Performing logical network functionality within data compute nodes

ABSTRACT

Some embodiments provide a method for a first managed forwarding element operating within a first data compute node (DCN) that executes on a host machine. From the first DCN, the method receives a packet destined for a second DCN that is logically connected to the first DCN through a set of logical forwarding elements of a logical network. The method performs forwarding processing on the packet in order to (i) identify a particular logical forwarding element in the set of logical forwarding elements, a logical port of which is coupled to the second DCN, and (ii) identify a second managed forwarding element that implements the logical port of the particular logical forwarding element. The method forwards the packet to the second managed forwarding element.

BACKGROUND

In today's Software-Defined Networking (SDN), a control plane implementsand maintains the control logic that governs the forwarding behavior ofshared network switching elements on a per user basis. For example, thelogical network of a tenant of a hosting system connects a set of endmachines (e.g., virtual machines, physical machines, etc.) that areassigned to the tenant, to each other and to other virtual and/orphysical networks through a set of logical forwarding elements (e.g.,logical switches, logical routers, etc.).

Conventionally, a virtualization software (e.g., a hypervisor) of eachhost machine implements different sets of logical forwarding elementsthat connect the end machines operating on the host machine to differentlogical networks. However, adding a layer to the virtualization softwareto implement the different logical networks imposes performance overheadto the virtualization software and lowers the overall efficiency of thehost machine that executes the virtualization software. Additionally,the hypervisor does not have control (e.g., to enforce network policies)over the end machines that can have access to hardware directly (e.g.,through the pass-through technology).

BRIEF SUMMARY

Some embodiments provide a managed forwarding element (MFE) within adata compute node (DCN) that operates on a host machine in order toenable the DCN to perform network functionalities (e.g., L2 switching,L3 routing, tunneling, etc.) that are normally performed by thevirtualization software of the host machine. In some embodiments, theMFE in the data compute node (referred to as DCN-MFE hereinafter)performs these network functionalities instead of, or in conjunctionwith, a managed forwarding element that resides in the virtualizationsoftware (e.g., in the hypervisor) of the host machine.

In some embodiments, a local controller that operates on the hostmachine (e.g., in the hypervisor of the host machine) configures andmanages a DCN-MFE within each DCN (e.g., virtual machine, physicalmachine, container, etc.) executing on the host machine. In someembodiments, the local controller receives the configuration andforwarding data required to configure and manage the DCN-MFEs from acentral control plane (CCP) cluster. The CCP cluster of some embodimentsincludes one or more central controllers that configure and manage oneor more logical networks for one or more tenants of a hosting system(e.g., a datacenter). In some embodiments, the CCP cluster (1) receivesdata that defines a logical network (e.g., from a user), (2) based onthe received data, computes the configuration and forwarding data thatdefine forwarding behaviors of a set of logical forwarding elements forthe logical network, and (3) distributes the computed data to a set oflocal controllers operating on a set of host machines.

In some embodiments, each local controller resides on a host machine(e.g., in the virtualization software of the host machine) that executesone or more DCNs of the logical network. The DCNs of the logical networkthat execute on different host machines logically connect to each other,and to other physical or logical networks, through the set of logicalforwarding elements (e.g., logical switches, logical routers, etc.). Insome embodiments, each local controller, after receiving the logicalnetwork data from the CCP cluster, generates configuration andforwarding data that defines forwarding behaviors of (1) an MFE thatresides on the same host machine alongside the local controller, and (2)each DCN-MFE of each DCN of the host machine that participates in thelogical network. The local controller then distributes the generateddata to the managed forwarding element (MFE) and the DCN-MFEs. Each ofthe MFE and DCN-MFEs implements the set of logical forwarding elementsbased on the configuration and forwarding data received from the localcontroller.

The configuration and forwarding data that the local controller of someembodiments generates for the MFE of the host machine, however, may bedifferent from the configuration and forwarding data that the localcontroller generates for the DCN-MFEs of the same host machine. The MFEresides in the hypervisor of the host machine and is connected toseveral different DCNs, different subsets of which may belong todifferent logical networks of different tenants. As such, the MFE shouldbe capable of implementing different sets of logical forwarding elementsfor different logical networks. On the other hand, each DCN-MFE thatresides in a DCN (e.g., a virtual machine (VM)) is only capable ofimplementing the logical network to which the DCN is connected in someembodiments. Hence, in some embodiments, the forwarding andconfiguration data generated for the MFE of the host machine could bedifferent (e.g., covering more logical networks' data) from theforwarding and configuration data generated for the DCN-MFEs of theDCNs.

Additionally, the forwarding and configuration data that a localcontroller of some embodiments generates for different DCNs that operateon the same host machine could be different from one DCN to another.That is, the logical network data generated for a particular DCN-MFEoperating in a DCN of a host machine could be different from the logicalnetwork data generated for a DCN-MFE of a different DCN in the same hostmachine (e.g., when the two DCNs are connected to two different logicalnetworks). In other words, in some embodiments, each DCN-MFE onlyimplements a set of logical forwarding elements (e.g., logical switches,logical router, etc.) of the logical network to which the DCN containingthe DCN-MFE logically connects.

In some embodiments, the DCN-MFE of a DCN enables the DCN to performnetwork traffic forwarding processing in the DCN, instead of having theMFE operating in the virtualization software (e.g., the hypervisor) ofthe host machine perform the packet forwarding processing. In someembodiments the DCN-MFE performs the packet forwarding processing forboth outgoing and incoming network traffic. In some such embodiments,the data compute node offloads and receives the processed networktraffic directly to and from a physical network interface controller(PNIC) of the host machine. That is, the DCN-MFE of the DCN exchangesthe network traffic with the PNIC without communicating with a managedforwarding element that operates in the hypervisor (e.g., in thepass-through approach). In some embodiments, however, when a source DCNoperating on a host machine needs to transmit the network traffic to adestination DCN that operates on the same host machine, the source anddestination DCNs employ the hypervisor as an intermediary means forexchanging the network traffic.

In some embodiments, the DCN-MFE of the source DCN, after realizing thatthe destination DCN operates on the same host machine as the source DCN,offloads the packets destined for the destination DCN to a memory spaceof the host machine that is controlled by the virtualization software(e.g., hypervisor) and that is shared with the destination DCN. Thesource DCN-MFE then notifies the hypervisor of the offload. In someembodiments, after receiving the offload notification, the hypervisornotifies the destination DCN-MFE about the new network traffic (e.g.,data packets) that is stored in the shared memory space. The destinationDCN-MFE of some such embodiments reads the packets from the sharedmemory upon receiving the notification from the hypervisor.

The shared memory space includes one or more particular physical pagesof a host machine's physical memory that the hypervisor of the hostmachine assigns as a shared memory space for the DCNs operating on thehost machine in some embodiments. In some such embodiments, thehypervisor of the host machine assigns the physical page(s) as sharedmemory space between the DCNs by mapping the physical page(s) to one ormore particular physical pages in each DCN that shares the memory space.In this manner, the same physical pages of the host machine's memorybecome available to two or more DCNs operating on the host machine forwriting to and reading from these shared physical pages.

In some embodiments, when the source and destination DCNs that includeDCN-MFEs operate on different host machines and exchange data throughthe PNICs of the host machines directly (i.e., use the pass-throughapproach), the source and destination DCN-MFEs use a particular tunnelprotocol to exchange network traffic between each other. That is, thesource DCN-MFE uses a particular tunnel protocol (e.g., VXLAN, STT,Geneve, etc.) to encapsulate the packets with the source and destinationDCN-MFE addresses (e.g., IP addresses associated with the DCN-MFEs).These source and destination addresses inserted in outer tunnel headersof the packets (e.g., in the packet headers) are used as the tunnelsource and destination endpoints, respectively. The source DCN-MFE thensends the encapsulated packets towards the destination DCN (through thephysical NIC of the host machine and onto the physical network betweenthe DCNs). The destination DCN-MFE (i.e., the destination endpoint) thendecapsulates the packets (i.e., removes the tunneling information addedto the packets) using the particular tunnel protocol and sends thepackets towards their corresponding destination in the destination DCN.

In some embodiments, even though the forwarding processing of thenetwork traffic is done by the DCN-MFE, the processed network traffic isstill sent to the virtualization software (e.g., through a virtualnetwork interface controller (VNIC) of the DCN) rather than the PNIC ofthe host machine (e.g., in the emulation approach). In some suchembodiments, the MFE of the virtualization software does not perform anyadditional forwarding processing on the outgoing packets and merelyhands the received packets to the PNIC of the host machine. In someembodiments, however, the MFE operating in the virtualization softwareof the host machine performs packet forwarding processing for theincoming network traffic. That is, the MFE of a host machine performsforwarding processing on the incoming packets that are destined for anyof the DCNs operating on the host machine. In other words, in someembodiments, the outgoing traffic is processed by the DCN-MFEs of theDCNs executing on a host machine, while the incoming traffic isprocessed by the MFE that operates in the virtualization software of thehost machine.

A reason for having the DCN-MFE process the outgoing traffic and thehypervisor's MFE process the incoming traffic is that some embodimentshave the forwarding element that is closer to the source of the packetsperform the packet processing in order to increase the network trafficefficiency. For example, during the processing of a packet, when thefirst forwarding element on the path of a packet determines that thepacket should be dropped (based on a network policy), the firstforwarding element drops the packet and does not send the packet to thesecond forwarding element on the path to make such a determination. Assuch, extra network resources are not deployed to continue forwarding apacket towards a destination when the packet will not reach thedestination anyway.

In some embodiments, when the source and destination DCNs operate ondifferent host machines and exchange network traffic through the MFEs ofthe virtualization software (i.e., in the emulation approach), thesource DCN-MFE and the MFE of the destination host machine (on which thedestination DCN operates) use a particular tunnel protocol to exchangethe network traffic. That is, the source DCN-MFE uses a tunnel protocol(e.g., VXLAN, STT, Geneve, etc.) to encapsulate the packets with thesource DCN-MFE and the destination MFE addresses (e.g., IP addresses) asthe tunnel endpoints before sending the packets towards the destinationDCN. The destination MFE (i.e., the MFE in the hypervisor of thedestination host machine) then decapsulates the packets (i.e., removesthe tunneling information added to the packets) using the same tunnelprotocol, and sends the packets towards the destination DCN.

Since the DCN-MFE of some embodiments is instantiated (and operates) ina DCN (e.g., a VM that belongs to a tenant of a hosting system), theDCN-MFE is more vulnerable to malicious attacks in comparison with anMFE that is instantiated (and operates) in a hypervisor of a hostmachine. This is because, although the DCN-MFE is instantiated in thekernel of a guest operating system (e.g., in the network stack of thekernel), in some embodiments, the DCN-MFE is still exposed to otherapplications and processes that run by the guest operating system. Onthe contrary, an MFE that operates in the hypervisor of a host machineis solely controlled by the central control plane (i.e., the CCPcluster) of the hosting system and is not exposed to any outsideapplications and/or processes.

In order to protect the DCN-MFE from malicious attacks, some embodimentsmark the pages that contain the code and data of the DCN-MFE (e.g., thememory space of the host machine on which the DCN-MFE's code and dataare loaded) as read-only to the guest operating system. Some suchembodiments only allow the hypervisor to write on the pages that aremarked as read-only for the guest operating system. Although thisapproach protects the DCN-MFE from being modified by the guest operatingsystem, a malicious module may still attack the DCN-MFE by loading ontothe guest kernel and simulating the functionalities of the DCN-MFE. Thatis, a malicious module could load onto the guest kernel and communicatewith the VNIC (of the DCN) or the PNIC (of the host machine) in the sameway that the DCN-MFE does, hence exposing these interfaces to maliciousattacks. In addition to marking the memory as read-only memory, someembodiments check one or more particular data structures of the guestkernel (e.g., in the same manner as an antivirus program) to ensure thatthe DCN-MFE is the only module that communicates with the PNIC and/orVNIC (e.g., through a secure communication channel).

Some embodiments protect the DCN-MFE from malicious attacks thatsimulate the DCN-MFE functionalities by isolating the DCN-MFE from othermodules and processes of the DCN in the host memory space. That is, someembodiments separate the memory space (e.g., in the host machine'sphysical memory), in which the code and data of the DCN-MFE are loaded(referred to as guest secure domain hereinafter) from the memory space,in which the other applications and processes of the DCN are loaded(referred to as guest general domain hereinafter). In some embodiments,the other applications and processes that are stored in the guestgeneral domain include the guest user space applications, as well as theprocesses and modules that are loaded in the guest kernel. Someembodiments store additional data and modules in the guest securedomain, in which the DCN-MFE is loaded, in order for the two guestdomains to be able to communicate with each other in a secure manner.

Conventionally, when a data compute node is loaded in a host machine(e.g., into the host machine's physical memory), the hypervisor of thehost machine creates and uses a set of nested page tables (NPTs) to mapthe guest physical memory space of the DCN to a portion (i.e., a set ofpages) of the host physical memory space. In order to separate the guestsecure domain from the guest general domain, the hypervisor of someembodiments creates two sets of NPTs for each DCN that is loaded in thehost machine (i.e., that starts operating on the host machine). In somesuch embodiments, the hypervisor creates a first set of NPTs (alsoreferred to as secure NPTs) and a second set of NPTS (also referred toas general NPTs). The secure NPTs include a set of tables that maps theguest physical memory addresses that contain the DCN-MFE (code and data)to the guest secure domain. Similarly, the general NPTs include a set oftables that maps the guest physical memory addresses that contain otherapplications and processes to the guest general domain.

Instead of using a separate secure domain for the code and data of theDCN-MFE, some embodiments employ a counter check security agent in orderto protect the DCN-MFE against malicious attacks. In some embodimentsthe counter check security agent operates in the virtualization softwareof the host machine. The counter check security agent of someembodiments receives a message from the DCN-MFE to increase a localcounter value by n (n being an integer greater than or equal to one).The counter check security agent receives this message when the DCN-MFEtransmits n packets (1) to a PNIC of the host machine directly (e.g., inthe pass-through approach), or alternatively (2) to a VNIC of the DCN tobe transmitted to the MFE of the virtualization software (e.g., in theemulation approach).

The counter check security agent of some embodiments, after receivingthe counter increase message from the DCN-MFE, retrieves a packetcounter value from the PNIC of the host machine and/or the VNIC of thedata compute node (depending on whether a pass-through approach or anemulation approach is in use). In some embodiments, the packet countervalue shows the total number of packets received at the PNIC and/orVNIC. By comparing the local counter value (after increasing the localcounter value by n) and the packet counter value received from the PNICand/or VNIC, the counter check security agent is able to determinewhether the DCN is under a malicious attack or not.

However, a determined malicious module that simulates the DCN-MFE in theguest kernel may also imitate the communication between the DCN-MFE andthe counter check security agent. In some embodiments, the DCN-MFE andcounter check security agents communicate with each other through acommunication channel, which is essentially a software function. Themalicious module could call the same counter increase function that theDCN-MFE calls. By calling the same function, the malicious module wouldalso send a counter increase message to the security agent of thehypervisor to increase the local counter by n, each time the maliciousmodule transmits n packets to the PNIC and/or VNIC. In other words, themalicious module would imitate both functions of the DCN-MFE to transmitthe packets out to the PNIC and/or VNIC, and to send a counter increasemessage to the counter check security agent with each transmission.

In order to protect the DCN-MFE against this type of malicious module,the hypervisor of some embodiments generates a list of valid returnaddresses, each of which indicates a valid return address of asubsequent instruction after the last instruction of the counterincrease function is executed. That is, each return address in the listof valid return addresses contains a memory address that points to asubsequent instruction that has to be executed after the lastinstruction of the counter increase function (called by the DCN-MFE) isexecuted. Additionally, each time any module (e.g., a DCN-MFE or amalicious module) calls the counter increase function, the module storesthe return address of the function in a call stack. The return addressof the increase function is the address of a subsequent instruction thathas to be executed after the counter increase function returns.

In some embodiments, each time a counter increase message is received,the counter check security agent checks the call stack maintained by theDCN, which contains the return address after the counter increasefunction is finished. In some other embodiments a different securityagent (other than the counter check security agent) that runs in thecommunication channel between the DCN-MFE and the hypervisor (inside thehypervisor or the DCN) checks the call stack. The security agent thenmatches the return address in the call stack of the DCN against the listof valid return addresses (that are kept in a local storage of thehypervisor or DCN). When no match is found, the security agentdetermines that a separate module (which has a different return addressfor the subsequent instruction) has called into the counter increasefunction and notifies the virtualization software of a potentialmalicious attack on the DCN.

The preceding Summary is intended to serve as a brief introduction tosome embodiments of the invention. It is not meant to be an introductionor overview of all of the inventive subject matter disclosed in thisdocument. The Detailed Description that follows and the Drawings thatare referred to in the Detailed Description will further describe theembodiments described in the Summary as well as other embodiments.Accordingly, to understand all the embodiments described by thisdocument, a full review of the Summary, Detailed Description and theDrawings is needed. Moreover, the claimed subject matters are not to belimited by the illustrative details in the Summary, Detailed Descriptionand the Drawing, but rather are to be defined by the appended claims,because the claimed subject matters can be embodied in other specificforms without departing from the spirit of the subject matters.

BRIEF DESCRIPTION OF THE DRAWINGS

The novel features of the invention are set forth in the appendedclaims. However, for purposes of explanation, several embodiments of theinvention are set forth in the following figures.

FIG. 1 illustrates an example implementation of DCN-MFEs withindifferent virtual machines that reside in different host machines.

FIG. 2 illustrates a local controller of some embodiments that alongwith a managed forwarding element, operates in the hypervisor of a hostmachine and configures and manages both of the managed forwardingelement and the DCN-MFEs of the host machine.

FIG. 3 conceptually illustrates a process of some embodiments thatdetermines which forwarding element operating on a host machine shouldperform forwarding processing for network traffic data that is generatedin a DCN.

FIG. 4 illustrates an example of establishing tunnels between DCN-MFEsof different data compute nodes that operate on different host machines.

FIG. 5 illustrates two different DCNs of two different host machinescommunicating with each other directly through the physical networkinterface controllers (PNICs) of the host machines using a particulartunnel protocol (e.g., VXLAN).

FIG. 6 illustrates two different DCNs of two different host machinescommunicating with each other through the hypervisors (i.e., MFEsimplemented in the hypervisors) of the host machines using a particulartunnel protocol (e.g., VXLAN).

FIG. 7 illustrates an example of a DCN-MFE within a data compute nodethat performs packet forwarding processing on a packet received from anapplication and transmits the processed packet directly to a PNIC of ahost machine that hosts the data compute node.

FIG. 8 illustrates another example of a DCN-MFE within a data computenode that receives a packet directly from a PNIC of a host machine thathosts the data compute node, and performs the necessary packetforwarding processing on the received packet.

FIG. 9 illustrates an MFE residing in the virtualization software of ahost machine that handles the outgoing traffic in the emulationapproach.

FIG. 10 illustrates an MFE residing in the virtualization software of ahost machine that handles the incoming traffic in the emulationapproach.

FIG. 11 conceptually illustrates a process of some embodiments foremploying the virtualization software of a host machine in order toexchange network data between two data compute nodes of the same hostmachine in the pass-through approach.

FIG. 12 illustrates two different ways of forwarding data from a sourceDCN-MFE based on the destination DCN-MFE being on the same host machineor a different host machine.

FIG. 13 illustrates a more detailed example of exchanging network databetween a source DCN-MFE and a destination DCN-MFE in a pass-throughapproach, when the virtual machines containing the source anddestination DCN-MFEs operate on the same host machine.

FIG. 14 conceptually illustrates a process that some embodiments performin order to isolate a guest secure domain in the physical memory of ahost machine for loading the code and data of a DCN-MFE of a datacompute node.

FIG. 15 illustrates a memory mapping system that some embodiments employto isolate a guest secure domain from the guest general domain in thephysical memory of the host machine.

FIG. 16 conceptually illustrates a process that some embodiments performto protect a DCN-MFE of a data compute node against malicious attacks byusing a packet counter value.

FIG. 17 illustrates an example of a counter check security agentoperating in a hypervisor of a host machine that protects a DCN-MFEagainst a malicious attack in the pass-through approach.

FIG. 18 illustrates another example of a counter check security agentoperating in a hypervisor of a host machine that protects a DCN-MFEagainst a malicious attack in the emulation approach.

FIG. 19 conceptually illustrates a process of some embodiments thatprotects a DCN-MFE of a data compute node against a malicious modulethat imitates the DCN-MFE in sending counter increase messages to thecounter check security agent.

FIG. 20 illustrates an example of a security agent operating in ahypervisor of a host machine along with a counter check security agentin order to protect a DCN-MFE against a malicious attack in thepass-through approach.

FIG. 21 conceptually illustrates an electronic system with which someembodiments of the invention are implemented.

DETAILED DESCRIPTION OF THE INVENTION

In the following detailed description of the invention, numerousdetails, examples, and embodiments of the invention are set forth anddescribed. However, it should be understood that the invention is notlimited to the embodiments set forth and that the invention may bepracticed without some of the specific details and examples discussed.

Some embodiments provide a managed forwarding element (MFE) within adata compute node (DCN) that operates on a host machine in order toenable the DCN to perform network functionalities (e.g., L2 switching,L3 routing, tunneling, teaming ports, link aggregation, etc.) that arenormally performed by the virtualization software of the host machine.In some embodiments, the MFE in the data compute node (i.e., theDCN-MFE) performs these network functionalities instead of, or inconjunction with, a managed forwarding element that resides in thevirtualization software (e.g., in the hypervisor) of the host machine.The managed forwarding element, in some embodiments, is a softwareinstance that is instantiated in the hypervisor of a host machine toperform network traffic forwarding processing for the packets that areoriginated from and/or destined for a set of DCN (e.g., virtualmachines) that reside on the host machine.

In some embodiments, one or more central controllers in a centralcontrol plane (CCP) cluster configure and manage one or more logicalnetworks for one or more tenants of a hosting system (e.g., adatacenter). In some embodiments, a logical network of the hostingsystem logically connects different data compute nodes (e.g., endmachines such as virtual machines (VMs), physical servers, containers,etc.) through a set of logical forwarding elements (e.g., logical L2switches and logical L3 routers). Some of the end machines (e.g., thevirtual machines, containers, etc.) reside on host machines that executemanaged forwarding elements (MFEs), which implement the logicalforwarding elements of the logical network to which the end machines arelogically connected. In other words, each of the host machines executesan MFE that processes packets sent to and received from the end machinesresiding on the host machine, and exchanges these packets with otherhardware and software managed forwarding elements (e.g., throughtunnels).

FIG. 1 illustrates an example implementation of DCN-MFEs withindifferent virtual machines that reside in different host machines. Morespecifically, this figure shows the communication channels between acentral control plane, the host machines, and the different elements ofthe host machines for exchanging configuration data and network trafficbetween the different elements. FIG. 1 includes a central control plane(CCP) cluster 105 and a set of host machines 120. The CCP cluster 105includes several central controllers 110, while each host machine 120includes a set of VMs 130, a hypervisor 135, and a physical networkinterface controller (PNIC) 150. Each of the shown VMs 130 that operateson one of the host machines 120 includes a data compute node managedforwarding element (DCN-MFE) 140.

One of ordinary skill in the art will realize that a CCP cluster or ahost machine of some embodiments includes many more elements and modulesthat are not shown in this figure for the simplicity of description. Theillustrated figure only shows some of the modules and elements of theCCP cluster and the host machines that are more relevant to theembodiments that are described above and below. Each one of the elementsshown in the figure is now described in more detail below.

The illustrated central controllers 110 of some embodiments areresponsible for (i) receiving definitions of different logical networksfor different tenants of a hosting system (e.g., from a networkadministrator), and (ii) distributing the logical configuration andforwarding data to the managed forwarding elements (not shown) thatreside in the hypervisors 135, and to the DCN-MFEs 140 that reside inthe VMs 130. The MFEs of the hypervisors and DCN-MFEs use thedistributed data to implement different logical switches (e.g., logicalL2 switches, logical L3 switches, etc.) of the different logicalnetworks in order to logically connect the virtual machines together andto other physical and logical networks (e.g., an external physicalnetwork, a third party hardware switch, a logical network of a tenant ofanother hosting system, etc.).

A virtualization software such as the hypervisor 135, in someembodiments, executes in the host machine 120 and is responsible forcreating and managing a set of virtual machines 130 in the host machine.In some embodiments, a hypervisor (i.e., an MFE in the hypervisor)performs various different network functionalities for the data computenodes (VMs) that run on the host machine. For example, after ahypervisor creates a set of VMs (e.g., for a tenant of a datacenter) onone or more host machines, the MFEs running inside the hypervisors ofthe host machines implement a set of logical forwarding elements thatconnects the different VMs of the tenant to each other and to othernetworks. In some embodiments, the MFEs exchange the logical networktraffic for the tenant by establishing tunnels between each other (i.e.,encapsulating the packets using a tunnel protocol) and serving as tunnelendpoints to exchange the network data.

However, having the DCNs of a host machine transmit the network trafficto the hypervisor (e.g., through a virtual network interface controller(VNIC)) in order for the hypervisor to enforce network policies and toperform packet forwarding for the DCNs (as in the emulation approach)adds extra layer of software to the hypervisor and imposes performanceoverhead. Conversely, having the VMs of a host machine communicatedirectly with the PNIC (or a simulated PNIC) of the host machine (as inthe pass-through approach) avoids much of the emulation approach'soverhead. However, in this approach (i.e., pass-through), the controlplane of the network has less or no control over the packets that aregenerated by or destined for the DCNs, since the DCNs do not employ thehypervisor for data exchanging with the physical network interface.

In some embodiments, the DCN-MFE 140 is a software instance (e.g., adriver) that is instantiated inside the data compute node (e.g., in thenetwork stack of the DCN's kernel). The DCN-MFE 140 enables the VM toperform packet forwarding processing and enforce network policies on thepackets inside the VM. This way, the VM 130 can generate data packetsand transmit the packets directly to the PNIC 150, or receive datapackets that are destined for the VM directly from the PNIC 150, withoutimposing any overhead on the hypervisor 135. Instead of or inconjunction with, communicating directly with the PNIC 150, the DCN-MFE150 of some embodiments performs the packet forwarding processing insidethe VM and still sends the processed packets to the hypervisor (e.g., toan MFE residing in the hypervisor). In some such embodiments, since thepackets are already processed by the VM, the hypervisor merely hands thepackets to the PNIC 150 and as such, the performance impact on thehypervisor is minimum. Moreover, since the DCN-MFE 140 is configured bythe CCP cluster 105, the packet forwarding processing is controlled bythe control plane.

In some embodiments, the DCN-MFE 140 is configured (e.g., by the CCPcluster) in such a way to be able to switch between the pass-throughapproach (i.e., direct communication with the PNIC) and the emulationapproach (i.e., communicating with the hypervisor through a VNIC). Insome such embodiments, when the PNIC 150 is available, the DCN-MFE 140utilizes the PNIC for transmitting network traffic. On the other hand,when the PNIC becomes unavailable to the DCN-MFE for any reason (e.g.,the PNIC is reassigned to another VM through dynamic reconfiguration orreallocation of physical resources), the DCN-MFE 140 utilizes a VNIC(not shown in this figure) and exchanges the network traffic with thehypervisor 135 through the VNIC.

The above introduced the general concepts and implementation of DCN-MFEsin some embodiments, as well as certain aspects of the forwardingprocessing by the DCN-MFEs within the data compute nodes. In thefollowing, Section I describes how a DCN-MFE of some embodimentsperforms network functionalities that are conventionally assigned to ahypervisor of a host machine instead of, or in conjunction with, thehypervisor. Next, Section II describes securing a DCN-MFE that isinstantiated in a data compute node against any potential maliciousattacks. Section III then describes the electronic system with whichsome embodiments of the invention are implemented.

I. Packet Forwarding Processing within a DCN

In some embodiments, a local controller that operates on the hostmachine (e.g., in the hypervisor of the host machine) configures andmanages a DCN-MFE within each DCN (e.g., virtual machine, physicalmachine, container, etc.) executing on the host machine. In someembodiments, the local controller receives the configuration andforwarding data required to configure and manage the DCN-MFEs from acentral control plane (CCP) cluster.

As described above, the CCP cluster of some embodiments includes one ormore central controllers that configure and manage one or more logicalnetworks for one or more tenants of a hosting system (e.g., adatacenter). In some embodiments, the CCP cluster (1) receives data thatdefines a logical network (e.g., from a user), (2) based on the receiveddata, computes the configuration and forwarding data that defineforwarding behaviors of a set of logical forwarding elements for thelogical network, and (3) distributes the computed data to a set of localcontrollers operating on a set of host machines.

In some embodiments, each local controller, along with a managedforwarding element, resides on a host machine (e.g., in thevirtualization software of the host machine) that executes one or moreDCNs of the logical network. The DCNs of the logical network thatexecute on different host machines logically connect to each other (andto other physical or logical networks) through the set of logicalforwarding elements (e.g., logical switches, logical routers, etc.).

In some embodiments, each local controller, after receiving the logicalnetwork data from the CCP cluster, generates configuration andforwarding data that defines forwarding behaviors of (1) the MFE thatresides on the same host machine alongside the local controller, and (2)each DCN-MFE of each DCN of the host machine that participates in thelogical network. The local controller then distributes the generateddata to the MFE and the DCN-MFEs. Each of the MFE and DCN-MFEsimplements the set of logical forwarding elements based on theconfiguration and forwarding data received from the local controller.

The configuration and forwarding data that the local controller of someembodiments generates for the MFE of the host machine, however, may bedifferent from the configuration and forwarding data that the localcontroller generates for the DCN-MFEs of the same host machine. The MFEresides in the hypervisor of the host machine and is connected toseveral different DCNs, different subsets of which may belong todifferent logical networks of different tenants. As such, the MFE shouldbe capable of implementing different sets of logical forwarding elementsfor different logical networks. On the other hand, each DCN-MFE thatresides in a DCN (e.g., a virtual machine or VM) is only capable ofimplementing one or more logical networks to which the DCN is connectedin some embodiments (e.g., the logical network(s) that are accessible toa tenant). Hence, in some embodiments, the forwarding and configurationdata generated for the MFE of the host machine could be different (e.g.,covering more logical networks' data) from the forwarding andconfiguration data generated for each DCN-MFE that operates on the samehost machine.

Additionally, the forwarding and configuration data that a localcontroller of some embodiments generates for different DCNs that operateon the same host machine could be different from one DCN to another. Asdescribed above, in some embodiments, each DCN-MFE only implements a setof logical forwarding elements (e.g., logical switches, logical router,etc.) of the logical network to which the DCN containing the DCN-MFElogically connects. As such, the logical network data generated for aparticular DCN-MFE operating in a DCN of a host machine could bedifferent from the logical network data generated for a DCN-MFE of adifferent DCN in the same host machine (e.g., when the two DCNs areconnected to two different logical networks).

FIG. 2 illustrates a local controller of some embodiments that alongwith a managed forwarding element, operates in the hypervisor of a hostmachine and configures and manages both of the managed forwardingelement and the DCN-MFEs of the host machine. This figure includes thesame CCP cluster 105 and host machines 120 that were shown in FIG. 1.Additionally, FIG. 2 shows that each hypervisor 135 includes a localcontroller 210 and a managed forwarding element (MFE) 220. Furthermore,except for the DCN 230 (VM4) of Host Machine 2, each of the other threeDCNs 130 (VM1-3) includes a DCN-MFE 140.

The virtual machines VM1-4 communicate with each other and other networkentities (e.g., third-party hardware switches) via one or more logicalnetworks, to which they are connected through the MFEs 220 and theDCN-MFEs 140. One of ordinary skill in the art would realize that thenumber of the host machines and DCNs illustrated in the figure areexemplary and only to simplify the description. Otherwise, a logicalnetwork for a tenant of a hosting system may span a multitude of hostmachines (and other third-party physical devices), and logically connecta large number of end machines to each other (and to other third-partyphysical devices).

The CCP cluster 105 communicates with the MFEs 220 and the DCN-MFEs 140through the local controllers 210 in order to configure and manage theseforwarding elements. The MFEs 220 and DCN-MFEs 140, in turn, implementdifferent logical forwarding elements of the logical networks tologically connect the DCNs 130 and 230 operating on the host machines120 to each other, to other end machines operating on other hostmachines (not shown), and to other physical machines that are connectedto other third-party physical switches (not shown).

In some embodiments, the local controller 210 of each hypervisor of thehost machines receives logical network data from a central controller110 of the controller cluster 105. The controller 210 then converts andcustomizes the received logical network data for the local physicalforwarding elements (i.e., the MFE 220 and DCN-MFEs 140 that operate onthe same machine on which the local controller 210 operates). Asdescribed above though, the customized data generated for an MFE may bedifferent from the customized data generated for the DCN-MFEs.Similarly, the customized data generated for a DCN-MFE may be differentfrom the customized data generated for another DCN-MFE (operating in asame or different host machine). The local controller then delivers theconverted and customized data to the local physical forwarding elements220 and 140 on each host machine 120.

The CCP cluster of some embodiments communicates with the localcontrollers using a particular protocol (e.g., a Virtual Extensible LAN(VXLAN) control plane protocol), in order to distribute the logicalconfiguration and forwarding data to the local controllers. In someembodiments, the local controllers use the same or different protocol(e.g., the OpenFlow protocol) to distribute the converted logicalforwarding data to the MFEs of the host machines, as well as theDCN-MFEs of the DCNs running on the host machines. The local controllersof some embodiments use the same or different protocol (e.g., a databaseprotocol such as the OVSDB protocol) to manage the other configurationsof the forwarding elements, including the configuration of tunnels toother forwarding elements (MFEs and DCN-MFEs). In some other embodimentsthe local controllers 210 use other protocols to distribute theforwarding and configuration data to the forwarding elements (e.g., asingle protocol for all of the data or different protocols for differentdata).

In some embodiments, each of the DCN-MFEs 140 on a host machineimplements a particular set of logical forwarding elements (LFEs) for aparticular logical network that logically connects the DCN containingthe DCN-MFE to other network elements. An MFE 220 of a host machine, onthe other hand, implements different sets of logical forwarding elements(LFEs) for different logical networks to which the different DCNsoperating on the host machine are connected. As will be described inmore detail below, the DCN-MFEs 140 of each host machine performs thelogical forwarding operations (e.g., packet forwarding processing basedon the forwarding information of the implemented LFEs) instead of, or inconjunction with, the MFE 220 that operates on the same host machine. Asstated above, each set of logical forwarding elements (not shown)connects one or more of the end machines that reside on the same hostmachine to each other and to other end machines that are connected tothe logical network. The logically connected end machines of the hostmachine, together with other logically connected end machines (operatingon other host machines or connected to other hardware switches) create alogical network topology for the tenant of the hosting system.

In some embodiments, the connections of the end machines to the logicalswitch (as well as the connections of the logical switch to otherlogical switches such as a logical router) are defined using logicalports, which are mapped to the physical ports of the physical forwardingelements (MFEs and DCN-MFEs). As described above, in some embodiments,the LFEs (e.g., logical routers and switches) of a logical network areimplemented by each DCN-MFE of each DCN that is connected to the logicalnetwork. That is, in some embodiments, when the DCN-MFE receives apacket from the DCN (i.e., from an application that runs in the DCN),the DCN-MFE performs the network processing for the logical switch towhich the DCN logically couples, as well as the processing for anyadditional LFE (e.g., logical router processing if the packet is sent toan external network, logical router processing and processing for theother logical switch in the network if the packet is sent to an endmachine (DCN) coupled to the other logical switch, etc.).

In some embodiments, the DCN-MFEs implement the LFEs of the logicalnetwork through a set of flow entries. These flow entries are generatedby a local controller operating on each host machine (such as the localcontrollers 210). The local controller generates the flow entries byreceiving the logical forwarding data from the CCP cluster andconverting the logical forwarding data to the flow entries for routingthe packets of the logical network in the host machine. That is, thelocal controller converts the logical forwarding data to a customizedset of forwarding behaviors that is recognizable and used by theDCN-MFEs to forward the packets of the logical network between the endmachines. In other words, by using the generated flow entries, theDCN-MFEs are able to forward and route packets between data computenodes of the logical network that contain the DCN-MFEs.

In some embodiments, however, some or all of the DCN-MFEs are notflow-based software forwarding elements, but instead process packetsbased on configuration data that is generated by their respective localcontrollers. In some embodiments, the local controllers receive the samedata from the CCP cluster irrespective of the type of DCN-MFEs theymanage, and perform different data conversions for different types ofDCN-MFEs. Although in the described examples, each logical network isassigned to a particular tenant, one tenant may have many more logicalnetworks assigned to the tenant. Also, because the end machinesoperating on a particular host machine may belong to more than onelogical network (e.g., some of the end machines belong to a first tenantwhile the other end machines belong to a second tenant), each MFE of thehost machine (i.e., operating in the virtualization software of the hostmachine) implements different sets of logical forwarding elements thatbelong to different logical networks.

Lastly, not all the DCNs executing on a host machine are required toimplement a DCN-MFE in some embodiments. As illustrated in FIG. 2, thevirtual machine 230 (VM4) does not include any DCN-MFE. In other words,in some embodiments, some of the end machines (DCNs) do not executepacket forwarding pipelines (i.e., do not implement LFEs, tunneling,etc.) within the end machine and instead, forward the packets to anassociated MFE executing in the hypervisor to perform the necessarypacket forwarding processing. In some such embodiments, an end machinethat does not implement a DCN-MFE couples to the logical networksthrough its associated MFE (running in the hypervisor).

A DCN-MFE of some embodiments determines whether the DCN-MFE can performpacket forwarding processing on a packet when the DCN-MFE receives thepacket from one of the applications that executes on the DCN. That is,based on the configuration and forwarding data that the DCN-MFE hasreceived for an associated local controller that executes on the samehost machine, the DCN-MFE knows to which particular logical network(s)the DCN is connected. The DCN-MFE, therefore, implements only thelogical forwarding elements of that particular logical network(s).

In some such embodiments, when a DCN-MFE receives a packet that belongsto a different logical network, the DCN-MFE does not perform forwardingprocessing and instead, sends the packet to the MFE of the host machineto perform the forwarding processing on the packet. That is, when theDCN-MFE extracts the forwarding data (e.g., source and destinationaddresses) from the different network layers' headers of the packet(e.g., L2 header, L3 header, etc.), the DCN-MFE can determine whetherthe packet is destined for a logical network for which the DCN-MFE isconfigured, or a different logical network. When the DCN-MFE determinesthat the packet is destined for a DCN that is connected to a logicalnetwork that the DCN-MFE does not implement, the DCN-MFE forwards thepacket to the MFE of the host machine for forwarding processing.

FIG. 3 conceptually illustrates a process 300 of some embodiments thatdetermines which forwarding element operating on a host machine shouldperform forwarding processing for network traffic data that is generatedin a DCN. More specifically, the process determines whether the DCN-MFEis capable of performing the forwarding processing on a packet that theDCN-MFE receives from an application, or the packet should be sent tothe MFE of the host machine to be processed. The process 300 of someembodiments is performed by the DCN-MFE that runs in a DCN (e.g. avirtual machine (VM), a container, etc.) operating on a host machine.

The process starts by receiving (at 310) a packet from one of theapplications that executes in the DCN. In some embodiments the DCN-MFEis a driver that operates in the kernel of the DCN, while theapplications that generate the packets to be forwarded to other networkelements operate in the user space of the DCN. Before starting anyforwarding processing on the received packet, the process identifies (at320) the destination network to which the destination DCN is connected.The destination DCN is the final destination data compute node thatexecutes the destination application to which the packet is sent. Inorder to identify the destination logical network, the process of someembodiments extracts a set of forwarding data (e.g., source anddestination addresses) from the different network layers' headers of thereceived packet (e.g., L2 header, L3 header, etc.). Based on thisextracted data, the process can determine the destination address of thepacket (e.g., based on the destination IP address stored in the L3header of the packet).

The process then determines (at 330) whether the destination network isone of the logical networks that the DCN-MFE implements. The process ofsome embodiments determines that the DCN-MFE implements the destinationlogical network when the DCN-MFE has received the necessaryconfiguration and forwarding data to implement the destination logicalnetwork. For example, the process can determine that the packet shouldbe sent to an end machine in a different logical network, when none ofthe logical forwarding elements that the DCN-MFE implements (for one ormore logical networks) has a logical port associated with thedestination address. In other words, when the DCN-MFE does not haveenough information (that is received in the configuration data) for alogical port of a logical switch to which the destination DCN couples,the DCN-MFE determines that the packet is destined for a differentlogical network in some embodiments. On the other hand, some embodimentsensure that the DCN-MFE is always configured to implement any logicalnetworks reachable from its DCN without requiring that the packet passthrough a centralized MFE.

When the process determines that the DCN-MFE does not implement thedestination logical network, the process transmits (at 340) the packetto an MFE operating in the hypervisor of the host machine so that theMFE performs the necessary forwarding processing on the packet. Theprocess then ends. As described above, the MFE is connected to severaldifferent DCNs operating on the host machine, different subsets of whichmay belong to different logical networks of different tenants. As such,the forwarding and configuration data that the MFE of the host machinereceives includes data for all of these logical networks and not justthe logical network(s) to which the source DCN couples. Based on thisconfiguration and forwarding data, the MFE is capable of implementingdifferent sets of logical forwarding elements for the different logicalnetworks.

On the other hand, when the process determines that the destinationlogical network is one of the logical networks that the DCN-MFEimplements, the process starts (at 350) performing forwarding processingon the received packet. The forwarding processing of network data isdescribed in more detail below. The process then ends.

The specific operations of the process 300 may not be performed in theexact order shown and described. The specific operations may not beperformed in one continuous series of operations, and different specificoperations may be performed in different embodiments. Furthermore, theprocess 300 could be implemented using several sub-processes, or as partof a larger macro process.

After receiving a logical network configuration data from the localcontrollers, the MFEs and DCN-MFEs establish tunnels (e.g., a VirtualExtensible LAN (VXLAN) tunnel, a Stateless Transport Tunneling (STT)tunnel, a Geneve tunnel, etc.) between themselves (e.g., a full mesh oftunnels between all of the configured forwarding elements that implementthe logical network) in order to exchange the logical network packetsbetween the end machines that are coupled to the MFEs and/or theDCN-MFEs.

In some embodiments, when the source and destination DCNs operate ondifferent host machines and exchange data through the PNICs of the hostmachines directly (i.e., use the pass-through approach), the source anddestination DCN-MFEs use a particular tunnel protocol (e.g. VXLAN) toexchange network traffic between each other. That is, the source DCN-MFEuses a particular tunnel protocol to encapsulate the packets with thesource and destination DCN-MFE addresses (e.g., IP addresses associatedwith the DCN-MFEs). These source and destination addresses inserted inthe packets (e.g., in the packet headers) are used as the tunnel sourceand destination endpoints, respectively. The source DCN-MFE then sendsthe encapsulated packets towards the destination DCN. The destinationDCN-MFE (i.e., the destination endpoint) then decapsulates the packets(i.e., removes the tunneling information added to the packets) using theparticular tunnel protocol and sends the packets towards theircorresponding destination in the destination DCN.

In some embodiments, when the source and destination DCNs operate on twodifferent host machines and exchange network traffic through the MFEs ofthe virtualization software (i.e., use the emulation approach), thesource DCN-MFE and the MFE of the destination host machine (on which thedestination DCN operates) use a particular tunnel protocol to exchangethe network traffic. That is, the source DCN-MFE uses a tunnel protocol(e.g., a VXLAN tunnel protocol) to encapsulate the packets with thesource DCN-MFE and the destination MFE addresses (e.g., IP addresses) asthe tunnel endpoints before sending the packets towards the destinationDCN. The destination MFE (i.e., the MFE in the hypervisor of thedestination host machine) then decapsulates the packets (i.e., removesthe tunneling information added to the packets) using the same tunnelprotocol, and sends the packets towards the destination DCN.

FIG. 4 illustrates an example of establishing tunnels between DCN-MFEsof different data compute nodes that operate on different host machines.Specifically, this figure shows a portion of a logical network 400 inthe top half of the figure, and a portion of a physical network 410 thatimplements the logical network portion 400. The logical network portion400 includes a logical switch (e.g., an L2 switch) 415 that connectsthree different data compute nodes (VM1-3) 420-430 to each other. TheseVMs may belong to a tenant of a hosting system. The physical networkportion 410 includes three host machines 440 (Host Machines1-3) thateach executes a hypervisor 445 and one of the VMs 420-430. Threedifferent DCN-MFEs 450-460 are implemented on the virtual machinesVM1-3, respectively.

As described above, each local controller converts the logical networkdata that defines forwarding behaviors of a set of logical forwardingelements (LFEs) to customized forwarding data that defines forwardingbehavior of each forwarding element (MFE and DCN-MFE) that implementsthe set of LFEs. The customized forwarding data, in some embodiments,includes, but is not limited to, (1) data (e.g., L2 data such as MACaddress resolution protocol (ARP) tables, L3 data such as routingtables, etc.) for the MFE or DCN-MFE to implement the required set ofLFEs for packets sent to and received from the DCNs, and (2) data (e.g.,virtual tunnel endpoint (VTEP) tables, etc.) to encapsulate thesepackets using a tunnel protocol in order to send the packets to otherMFE an/or DCN-MFEs.

The logical data for implementing the LFEs of some embodiments includestables that map addresses to logical ports of the LFEs (e.g., mappingMAC addresses of virtual machines 420-430 to logical ports of logicalswitch 415, mapping IP subnets to ports of logical routers (not shown),etc.), routing tables for logical routers, etc. In addition, the logicaldata includes mappings of the logical ports to physical ports of theMFEs or DCN-MFEs at which the machines connected to a logical port islocated. In some embodiments in which the DCN-MFEs are flow-basedsoftware forwarding elements, the local controller converts the receivedlogical network data into a set of flow entries that specifiesexpressions to match against the header of a packet, and actions to takeon the packet when a given expression is satisfied. Possible actions insome embodiments include modifying a packet, dropping a packet, sendingit to a given egress port on the logical network, and writing in-memorymetadata (analogous to registers on a physical switch) associated withthe packet and resubmitting the packet back to the logical network forfurther processing.

The tunneling data, in some embodiments, includes instructions on how toset up tunnels between the different forwarding elements (MFEs andDCN-MFEs). For instance, each of the DCN-MFEs 450-460 serves as a tunnelendpoint with a particular tunnel endpoint Internet Protocol (IP)address. Each DCN-MFE also receives addresses (e.g., tunnel endpoint IPaddresses) of the other DCN-MFEs, as well as other information (e.g.,logical network and logical port identifiers, etc.) to use whenencapsulating packets using the tunnel protocol.

In the illustrated example of FIG. 4, each of the DCN-MFEs 450-460,after receiving the tunneling data from its associated local controller,sets up a tunnel between the DCN-MFE and the other two DCN-MFEs asillustrated with highlighted double-headed arrows between theseDCN-MFEs. This is because, based on the received configuration andforwarding data, each DCN-MFE knows that the VM in which it operates isconnected to a logical network through the logical switch 415. As such,when an application in a particular VM generates an L2 packet thatshould be forwarded to another VM in the logical network, the DCN-MFEwithin the particular VM receives the L2 packet from the application andperforms the L2 logical forwarding processing for the packet. Afteridentifying the destination tunnel endpoint, the DCN-MFE encapsulatesthe packet with the destination tunnel endpoint data and forwards thepacket towards the destination VM (e.g., through the PNICs of the hostmachines executing the source and destination VMs). Two examples ofexchanging packets between two DCN-MFEs as tunnel endpoints using twodifferent approaches (i.e., pass-through and emulation approaches) aredescribed below by reference to FIGS. 5 and 6.

As described above, in the pass-through approach of some embodiments,the source and destination DCNs operating on different host machinesexchange data through the PNICs of the host machines without employingthe hypervisors of the host machines. FIG. 5 illustrates an example ofone such approach. More specifically, this figure shows two differentDCNs of two different host machines communicating with each otherdirectly through the physical network interface controllers (PNICs) ofthe host machines using a particular tunnel protocol (e.g., VXLANprotocol).

FIG. 5 shows two host machines 505 and 510. The first host machine 505has a PNIC 515 and executes a hypervisor 525 along with two virtualmachines 535 and 540. The second host machine 510 has a PNIC 520 andexecutes a hypervisor 530, as well as two virtual machines 545 and 550.In the illustrated example, the DCN 535 (VM1) has generated a packet 570(e.g., a guest application running on the machine generated the packet)to be sent to the DCN 545 (e.g., to a guest application that runs inVM2). Although not shown, the two DCNs 535 and 545 are connected to aparticular logical network (e.g., through a common logical switch,through two different logical switches that are connected to a commonlogical router, etc.). The logical network may or may not include theother two DCNs 540 and 550.

As shown in the figure, both of the DCN-MFEs 555 and 560 use thepass-through approach and communicate with each other directly throughthe PNICs 515 and 520 of the host machines 505 and 510, respectively. Inother words, these two DCN-MFEs do not need the MFEs that run in thehypervisors 525 and 530 in order to exchange the network data with eachother or with other network elements of the logical network to which thesource and destination DCNs are connected. As will be described in moredetail below by reference to FIGS. 11-13 though, when these DCN-MFEscommunicate with other DCN-MFEs that operate on the same host machines,the DCN-MFEs use their respective hypervisor in order to exchangenetwork traffic with the other DCN-MFEs.

In some embodiments, the source and destination DCN-MFEs 555 and 560 cancommunicate with the PNICs directly because the DCNs that contain thesetwo DCN-MFEs, also include the required physical drivers of the PNICs inorder to communicate with the PNICs. In fact, in some embodiments, aDCN-MFE is a managed forwarding element that operates in the PNIC driverof a data compute node. In some other embodiments, the DCN-MFE is aseparate driver that executes in the data compute node. As the DCNs 535and 545 are connected to a same logical network (e.g., connected to acommon logical L2 switch, to different L2 switches that are connected toa common L3 switch), both of the DCN-MFEs executing in these DCNs arecapable of implementing the logical network. That is, the DCN-MFEs arecapable of implementing the LFEs of the logical network based on thelogical network forwarding and configuration data that the DCN-MFEsreceive from their respective local controllers. The DCN-MFEs thenperform the forwarding processing on the packets based on the datastored for the LFEs and data stored in the packets' headers.

When the DCN-MFE 555 receives the generated packet (e.g., from a sourceguest application running in the DCN 535), the DCN-MFE starts executingthe forwarding pipelines of the logical forwarding elements that connectthe source DCN-MFE to the destination DCN-MFE. That is, the DCN-MFEextracts the data stored in different packet headers (e.g., destinationMAC address in the L2 header, destination IP address in the L3 header,etc.) and compares the extracted data with the forwarding data that theDCN-MFE receives from one or more central controllers of the CCPcluster.

Based on executing the forwarding pipelines, the source DCN-MFE 555realizes that the destination managed forwarding element is the DCN-MFE560 (e.g., the DCN-MFE 560 implements a logical port of a logical switchthat is associated with the destination DCN). After execution of thepipeline, the source DCN-MFE 555 establishes a tunnel 565 and uses thetunnel protocol to encapsulate the packet 570 with the IP address of thesource DCN-MFE 555 as the source endpoint of the tunnel and the IPaddress of the destination DCN-MFE 560 as the destination endpoint ofthe tunnel. The source DCN-MFE 555 establishes the tunnel by injectingthe source and destination IP addresses in the packet's IP source anddestination outer headers. The packet is then forwarded towards thedestination application in the destination DCN.

As shown in the figure, the physical path of the packet 570 though, isnot through the illustrated tunnel 565, since this tunnel is only anabstract concept. That is, the tunnel encapsulation information injectedinto the outer headers of the packet causes the intermediary networkingelements (e.g., PNICs 515 and 520, and other forwarding elements betweenthe PNICs) to ignore the inner source and destination headers of thepacket (as if a direct path between the two forwarding elements iscreated). In reality, the source DCN-MFE 555 sends the encapsulatedpacket to the PNIC 515, which in turn forwards the packet to the PNIC520 based on the outer headers information of the packet 570.

The PNIC 520 then forwards the packet directly to the destinationDCN-MFE 560, which is the destination endpoint of the tunnel 565. Thedestination DCN-MFE 560, after receiving the packet 570, decapsulatesthe packet (i.e., removes the tunneling information added to the packetby the source DCN-MFE 555) using the same tunnel protocol used toencapsulate the packet, and sends the packet to the destinationapplication inside the DCN 545 (e.g., based on the inner headersinformation of the packet such as the destination port address stored inthe L4 header of the packet).

In the above-described example, the source and destination VMs used thepass-through approach to exchange network data with each other. Asdescribed before, in the emulation approach, the source and destinationVMs, although processing the packets inside the VMs, exchange thepackets with each other through their respective hypervisors. FIG. 6illustrates an example of one such approach. More specifically, thisfigure shows two different DCNs of two different host machinescommunicating with each other through the hypervisors (i.e., MFEsimplemented in the hypervisors) of the host machines using a particulartunnel protocol (e.g., VXLAN protocol).

FIG. 6 includes similar elements as in FIG. 5, except that in thisfigure the hypervisor 525 includes an MFE 660 and the hypervisor 530includes an MFE 670. This figure shows that the first host machine 505has a PNIC 515 and executes a hypervisor 525 along with two virtualmachines 535 and 540. The second host machine 510 has a PNIC 520 andexecutes a hypervisor 530 along with two virtual machines 545 and 550.Also, in this figure the DCN 555 has generated a packet 570 to be sentto the DCN 560. Although not shown, the two DCNs 535 and 545 areconnected to a logical network through a common logical switch. Thelogical network may or may not include the other two DCNs 540 and 550.

As shown in this figure though, the DCN-MFEs 555 and 560 do not use thepass-through approach to communicate with each other directly throughthe PNICs 515 and 520 of the host machines 505 and 510 in order toexchange the packet 570. In other words, the source and destinationDCN-MFEs in the example of this figure exchange packets using theemulation approach instead of the pass-through approach. There can bevarious reasons for selecting the emulation approach over thepass-through approach in some embodiments. For example when a DCN doesnot include the necessary PNIC driver, or the PNIC driver running on theDCN is corrupt, the DCN uses the emulation approach to forward thepackets. Alternatively, the DCN-MFEs might be configured (by themanagement control plane) in such a way to exchange packets of aparticular type (e.g., belonging to a particular data flow, generated bya particular source application, etc.) only through the emulationapproach.

Therefore, instead of communicating with the PNICs directly, the twoDCN-MFEs 555 and 560 shown in FIG. 6 employ the MFEs 660 and 670 thatoperate in the hypervisors 525 and 530, respectively, in order toexchange network data with each other, or with other end machines of thelogical network to which the DCN-MFEs 555 and 560 are connected. It isimportant to note that even though the source DCN-MFE 555 forwards thepacket to the MFE 660 and the destination DCN-MFE 560 receives thepacket from the MFE 670, the packet still has to be forwarded throughthe PNICs 515 and 520 in order to be transmitted from the source hostmachine to the destination host machine.

In the emulation approach of some embodiments, between the pair ofsource DCN-MFE and MFE in the source host machine, the forwardingelement that is closer to the source of the packet performs theforwarding processing on the packet. Hence, since the source DCN-MFE isalways closer to the source of the packet (i.e., the source applicationthat generates the packet in the same DCN) in some embodiments, thesource DCN-MFE performs the necessary forwarding processing (e.g., alogical switch's forwarding pipeline) for the packet. In some suchembodiments, the source MFE only receives the packet from the sourceDCN-MFE and hands the packet to the corresponding PNIC of the sourcehost machine without performing any forwarding processing for thepacket.

Similarly, in some embodiments, between the pair of destination DCN-MFEand MFE in the destination host machine, the forwarding element that iscloser to the source of the packet performs the forwarding processing onthe packet. Therefore, in some embodiments, since the destination MFE isalways closer to the source of the packet (i.e., the destination PNIC,which collects the packets from the physical network), the destinationMFE performs the necessary forwarding processing for the packet. In somesuch embodiments the destination DCN-MFE only receives the packet fromthe destination MFE and hands the packet to the correspondingdestination application in the destination DCN without performing anyforwarding processing.

In some embodiments the closer forwarding element performs theforwarding processing in order to increase the efficiency of forwardingprocess. For example, during the processing of a packet, when the firstforwarding element on the path of a packet determines that the packetshould be dropped (e.g., based on a network policy received from thecontrol plane), the first forwarding element drops the packet and doesnot send the packet to the second forwarding element on the path to makesuch a determination. As such, extra network resources are not deployedto continue on forwarding the packet towards a destination while thepacket is not supposed to reach the destination.

In the illustrated example, since the two DCNs 535 and 545 are connectedto a common logical switch of the logical network (not shown), both ofthe DCN-MFEs executing in these DCNs are capable of performing thenecessary forwarding pipeline of the common logical switch (e.g., basedon the logical forwarding and configuration data that the DCN-MFEs havereceived from their respective local controllers). Additionally, both ofthe MFEs 660 and 670 receive the logical network configuration andforwarding data form their corresponding local controllers (not shown).Therefore, these two MFEs are also capable of performing forwardingpipeline of the logical switches of the logical network.

When the DCN-MFE 555 receives the generated packet (e.g., from a sourceapplication running in the DCN 535), the DCN-MFE starts executing theforwarding pipeline of the logical switch. This forwarding pipelineindicates to the DCN-MFE 555 that the destination application is in thesame subnet and connected to the same logical switch which is beingimplemented by both of the MFE 670 and the DCN-MFE 560 (e.g., based onthe different source and destination data stored in the different packetheaders).

After execution of the forwarding pipeline, the source DCN-MFE 555establishes a tunnel 680 and uses the tunnel protocol to encapsulate thepacket 570 with the IP address of the source DCN-MFE 555 as the sourceendpoint of the tunnel and the IP address of the destination MFE 670 asthe destination endpoint of the tunnel. This is because, in thedestination host machine, the destination MFE 670 is closer to thesource of the packet (on the transmission path from the sourceapplication to the destination application) compared to the destinationDCN-MFE 560. Consequently, the MFE 670 is selected as the destinationendpoint of the tunnel over the DCN-MFE 560.

As shown in the figure, the physical path of the packet 570 is notthrough the illustrated tunnel 680 though, since the illustrated tunnelis only an abstract concept. That is, the tunnel encapsulationinformation injected into the outer headers of the packet causes theintermediary networking elements (e.g., routers) ignore the inner sourceand destination headers of the packet (as if a direct path between thetwo forwarding elements is created). In reality, the source DCN-MFE 555sends the encapsulated packet to the MFE 660, which in turn forwards thepacket to the PNIC 515.

The PNIC 515 then forwards the packet to the PNIC 520 based on thetunneling data encapsulated in the outer headers of the packet 570. ThePNIC 520 then forwards the packet to the destination MFE 670, which isthe destination endpoint of the tunnel 680. The MFE 670 then performsthe forwarding processing for the packet (e.g., decapsulates the packetusing the same tunnel protocol used for packet encapsulation) andforwards the packet based on the information stored in the inner headersof the packet (e.g., the MAC address of the destination DCN) to thedestination DCN-MFE 560. The destination DCN-MFE 560, after receivingthe packet 570, sends the packet to the destination application insidethe DCN 545.

As described above, in some embodiments, the DCN-MFE of a data computenode enables the data compute node to perform network traffic forwardingprocessing in the DCN, instead of having the MFE of the virtualizationsoftware (e.g., the hypervisor) of the host machine perform the packetforwarding processing. In some embodiments that use the pass-throughtechnology, the DCN-MFE performs the packet forwarding processing forboth of the outgoing and incoming network traffic.

In the pass-through approach of some embodiments, the data compute nodeincludes and runs the required PNIC 's driver in order to directlycommunicate with the PNIC of the host machine. In some embodiments, theDCN-MFE is a separate driver that runs in the DCN. In some otherembodiments the DCN-MFE is part of the PNIC driver running in the DCN.The DCN-MFE of the data compute node of some embodiment (regardless ofbeing a part of the PNIC driver or being a separate driver in the DCN)is able to offload the processed network traffic directly to the PNIC ofthe host machine. Similarly, the DCN-MFE of the DCN is also able toreceive the network traffic directly from the PNIC of the host machine.In other words, the DCN-MFE exchanges the network traffic directly withthe PNIC of the host machine and without communicating with a managedforwarding element that operates in the hypervisor of the host machine.

In the pass-through approach of some embodiments, however, when a sourceDCN operating on a host machine needs to transmit the network traffic toa destination DCN that operates on the same host machine, the source anddestination DCNs employ the hypervisor as an intermediary means forexchanging the network traffic. Employing the hypervisor of a hostmachine to exchange network traffic between the DCNs of the host machineis described in detail below by reference to FIGS. 11-13.

FIG. 7 illustrates an example of a source DCN-MFE within a data computenode that performs packet forwarding processing on a packet receivedfrom an application and transmits the processed packet directly to aPNIC of a host machine that hosts the data compute node. Specifically,this figure shows, through two stages 705 and 710, how a DCN-MFE of avirtual machine receives a packet originated from one of the guestapplications running in the virtual machine, performs packet forwardingprocessing on the packet, and sends the packet to the physical NIC ofthe host machine to be forwarded towards the destination of the packet.

FIG. 7 shows a host machine 715 that includes a PNIC 720 and ahypervisor 725. The host machine can be a physical server or any othercomputer that hosts one or more data compute nodes (e.g., virtualmachines, physical machines, containers, etc.) for a tenant of a hostingsystem or a datacenter. The host machine also hosts a virtual machine730. The virtual machine 730 runs a DCN-MFE 740 in the kernel space 760(e.g., in the network stack in the kernel) of the virtual machine (VM)and two applications 750 and 755 in the user space of the VM.

In the first stage 705, the source application 750 has generated apacket 780 to be sent to a destination application executing in adestination data compute node that operates on a destination hostmachine (not shown). Since the DCN-MFE 740 executes all of theforwarding pipelines of the network to which the VM 730 is connected,all of the applications that run in this virtual machine 730 send andreceive their corresponding network traffic to and from the DCN-MFE 740.

Therefore, the application 750 sends the packet 780 (e.g., through acommunication channel instantiated in the virtual machine for suchcommunications) to the DCN-MFE 740. When the DCN-MFE 740 receives thepacket, the DCN-MFE determines whether the packet is destined for an endmachine of the same logical network to which the VM 730 is connected orit is destined for a different logical and/or physical network.

In some embodiments, if the packet is determined to belong to adifferent network (e.g., based on destination information in differentheaders of the packet), the DCN-MFE 740 forwards the packet to themanaged forwarding element (MFE) 770 that operates in the hypervisor ofthe host machine for further forwarding processing of the packet. Thatis, since the MFE implements all the logical forwarding elements of thedifferent logical networks, even if the packet is destined for adifferent logical or physical network, the MFE will have the necessaryforwarding pipeline to determine the next destination of the packet.

It should be understood that sending the packet to an MFE to performforwarding processing on the packet in this manner is not the same asprocessing the packet inside the virtual machine and using the MFEmerely as an intermediary to pass the packet to the PNIC (as was definedin emulation approach). In other words, when the DCN-MFE of someembodiments determines that the destination of a packet belongs to adifferent network, the DCN-MFE hands the packet to the MFE to performthe whole forwarding processing on the packet. This is different thanthe emulation approach, in which the DCN-MFE performs all the necessaryforwarding processing on the packet and then merely passes the packet tothe MFE to be handed to the PNIC without any further forwardingprocessing.

When the DCN-MFE 740 determines that the packet is destined for anotherdata compute node that belongs to the same logical network but on adifferent host machine, the DCN-MFE executes all the necessaryforwarding pipelines to determine which other managed forwarding element(i.e., other MFE or DCN-MFE) implements the logical switch to which thedestination data compute node couples. As an example, the source DCN-MFEcould be coupled to a first L2 logical switch, while the destinationDCN-MFE is coupled to a second, different logical switch. However, bothof the first and second logical switches are connected to each otherthrough a logical router.

As such, the DCN-MFE 740 executes the three forwarding pipelines of allthree L2 and L3 switches to determine that the destination DCN isconnected to the second L2 logical switch. The DCN-MFE 740 thenencapsulates the packet with tunneling information, in which, an IPaddress of the source DCN-MFE 740 is the source endpoint address of thetunnel and the IP address of the destination MFE or DCN-MFE thatimplements the second L2 switch is the destination endpoint address ofthe tunnel.

The second stage 710 shows that after the packet 780 is processed by theDCN-MFE 740, the packet is sent to the PNIC 720 to be sent to thedestination PNIC which is connected to the destination MFE or DCN-MFEthat implements the logical switch to which the destination DCN isconnected. As shown in this stage, the packet is sent directly to thePNIC 720 without being processed by the hypervisor 725 or the MFE 770that operates in the hypervisor 725. As described above, this ispossible in the pass-through approach because the virtual machine 730includes the PNIC driver of the PNIC 720 and as such can communicatewith the PNIC 720 directly instead of sending the packet to the MFE todo so (as in emulation approach). The second stage also shows that thetunneling information 790 has been added to the packet (e.g., stored inthe outer header of the packet) by the DCN-MFE.

FIG. 8 illustrates an example of a destination DCN-MFE within a datacompute node that receives a packet directly from a PNIC of a hostmachine that hosts the data compute node, and performs the necessarypacket forwarding processing on the received packet. Specifically, thisfigure shows, through two stages 805 and 810, how a DCN-MFE of a virtualmachine (i) receives a packet originated from a source MFE or DCN-MFEthat operates in a different host machine, (ii) performs packetforwarding processing on the received packet, and (iii) sends the packetto a destination application that runs on the same virtual machine onwhich the DCN-MFE runs.

Similar to the FIG. 7, this figure shows a host machine 715 thatincludes a PNIC 720 and a hypervisor 725. The host machine can be aphysical server or any other computer that hosts one or more datacompute nodes (e.g., virtual machines, physical machines, containers,etc.) for a tenant of a hosting system or a datacenter. The hos machinealso hosts a virtual machine 730. The virtual machine 730 runs a DCN-MFE740 in the kernel space 760 (e.g., in the network stack in the kernel)of the VM and two applications 750 and 755 in the user space of the VM.

In the first stage 805, the DCN-MFE 740 receives a packet 830 from thePNIC 720 of the host machine that is destined for one of the twoapplications 750 and 755 that run in the virtual machine 730. As shownin the first stage, the packet is received directly from the PNIC 720without being processed by the hypervisor 725 or the MFE 770 thatoperates in the hypervisor 725. As described above, this is possible inthe pass-through approach because the virtual machine 730 includes thePNIC driver of the PNIC 720 and as such can communicate with the PNIC720 directly instead of having the PNIC send the packet to the MFE andreceiving the packet from the MFE.

The first stage also shows that the packet is already encapsulated withthe tunneling information 820 (e.g., stored in the outer header of thepacket 830) by a source DCN-MFE that has performed the forwardingprocessing on the packet in a source machine (e.g., a source virtualmachine). The tunneling information 820 shows that the DCN-MFE 740 isthe destination endpoint of the tunnel (e.g., the destination IP addressin the outer header of the packet has the tunnel endpoint IP address ofthe DCN-MFE 740). The tunneling information also includes otherinformation such as the tunnel endpoint IP address of the source DCN-MFEthat has received the packet from the source application.

In other words, when the source DCN-MFE has received the packet from thesource application in the source machine, the source DCN-MFE hasdetermined that the packet is destined for an application inside the VM730 which belongs to the same logical network to which the source VMbelongs and as such, has encapsulated the packet with the DCN-MFE 740 asthe destination endpoint of the tunnel. The source DCN-MFE has done soby executing all the necessary forwarding pipelines of the logicalforwarding elements (e.g., logical L2 and L3 switches) that logicallyconnect the source and destination VMs to each other.

The second stage 810 shows that, after the encapsulated packet 830 isreceived from the PNIC 720 (e.g., based on the outer headers informationof the packet 830), the destination DCN-MFE 740 decapsulates the packet(i.e., removes the tunneling information added to the packet by thesource DCN-MFE) using the same tunnel protocol that has been used by thesource DCN-MFE to encapsulate the packet. The destination DCN-MFE 740then sends the packet 830 to the destination application 755 inside theVM 730. The destination DCN-MFE 740 sends the packet to the destinationapplication based on the inner header destination information of thepacket (e.g., the destination port address stored in the inner L4 headerof the packet that is associated with the destination application 755).

Unlike the pass-through approach, the DCN-MFE of some embodiments sendsthe packets (e.g., through a virtual network interface controller (VNIC)of the DCN) to the hypervisor of the host machine in the emulationapproach. That is, even though the forwarding processing of the networktraffic is done by the DCN-MFE, the processed network traffic is stillsent to the virtualization software rather than the PNIC of the hostmachine. In some such embodiments, the MFE operating in thevirtualization software does not perform any additional forwardingprocessing on the outgoing packets and merely hands the received packetsto the PNIC of the host machine. In some embodiments, however, the MFEoperating in the virtualization software of the host machine performspacket forwarding processing for the incoming network traffic that isdestined for a DCN-MFE of a DCN operating on the host machine. In otherwords, in some embodiments, the outgoing traffic is processed by theDCN-MFEs of the DCNs executing on a host machine, while the incomingtraffic is processed by the MFE operating in the virtualization softwareof the host machine.

The reason for having the DCN-MFE process the outgoing traffic and thehypervisor's MFE process the incoming traffic is that some embodimentshave the forwarding element that is closer to the source of the packetsperform the packet processing in order to increase the network trafficefficiency. For example, during the processing of a packet, when thefirst forwarding element on the path of a packet determines that thepacket should be dropped (based on a network policy), the firstforwarding element drops the packet and does not send the packet to thesecond forwarding element on the path to make such a determination. Assuch, extra network resources are not deployed to continue on forwardingthe packet towards a destination while the packet is not supposed toreach the destination.

FIGS. 9 and 10 illustrate how the MFE of a virtualization softwareexecuting in a host machine handles the incoming and outgoing traffic inthe emulation approach. More specifically, FIG. 9 illustrates an MFEresiding in the hypervisor of a host machine handling the outgoingtraffic in the emulation approach. More specifically, this figure shows,through two stages 905 and 910, how the MFE that operates in thehypervisor of a host machine receives a packet from a source VM andwithout performing any additional forwarding processing, delivers thepacket to the PNIC of the host machine.

As described before, the reason for a DCN-MFE to choose emulationapproach over the pass-through approach (i.e., to send the networktraffic to the PNIC through the MFE of the hypervisor instead of sendingthe packets directly to the PNIC) could be different in differentembodiments. For example when a DCN does not include the necessary PNICdriver, or the PNIC driver running on the DCN is corrupt, the DCN usesthe emulation approach to forward the packets through the MFE running inthe hypervisor. Alternatively, the DCN-MFEs might be configured (by themanagement control plane) in such a way to exchange packets of aparticular type (e.g., belonging to a particular data flow, generated bya particular source application, etc.) only through the MFEs that run onthe hypervisors of their respective host machines.

FIG. 9 shows a host machine 915 that includes a PNIC 920 and ahypervisor 930. The hypervisor 930 executes a managed forwarding element940 which performs different network functionalities for the virtualmachines executing in the host machine. In the shown figure, the hostmachine 915 hosts two virtual machines 950 and 955. The virtual machine950 runs a DCN-MFE 960 (e.g., in the kernel of the virtual machine VM1).The virtual machine 955 runs a DCN-MFE 965 (e.g., in the kernel of thevirtual machine VM2). Each of these virtual machines also executesseveral different applications (not shown) in the user space of thevirtual machine.

In the first stage 905, a source application in the virtual machine 950has generated a packet 970 to be sent to a destination applicationexecuting in a destination data compute node that operates on adifferent host machine (not shown). Since the DCN-MFE 960 executes allof the forwarding pipelines of the network to which the VM 950 isconnected, all of the applications that run in this virtual machine,including the source application of the packet 970, send and receivetheir corresponding network traffic to and from the DCN-MFE 960.

Therefore, the source application sends the packet 970 (e.g., through acommunication channel instantiated in the virtual machine for suchcommunications) to the DCN-MFE 960. When the DCN-MFE 960 receives thepacket, the DCN-MFE determines whether the packet is destined for a DCNthat is connected to the same logical network to which the VM 960 isconnected or it is destined for a different logical and/or physicalnetwork.

In some embodiments, if the packet is determined to belong to adifferent network (e.g., based on destination information in differentheader of the packet), the DCN-MFE 960 forwards the packet to the MFE940 that operates in the hypervisor of the host machine for furtherforwarding processing of the packet. That is, since the MFE 940implements all the logical forwarding elements of the different logicalnetworks, even if the packet is destined for a different logical orphysical network, the MFE will have the necessary forwarding pipeline todetermine the next destination of the packet.

It should be understood that sending the packet to an MFE to performforwarding processing on the packet in this manner is not the same asprocessing the packet inside the virtual machine and using the MFEmerely as an intermediary to pass the packet to the PNIC (as shown inthis figure). In other words, when the DCN-MFE of some embodimentsdetermines that the destination of a packet belongs to a differentnetwork, the DCN-MFE hands the packet to the MFE to perform the wholeforwarding processing on the packet. This is different than theemulation approach shown in this figure, in which the DCN-MFE performsall the necessary forwarding processing on the packet and then merelypasses the packet to the MFE to be handed to the PNIC without anyfurther forwarding processing.

When the DCN-MFE 960 determines that the packet is destined for anotherdata compute node that belongs to the same logical network but on adifferent host machine, the DCN-MFE executes all the necessaryforwarding pipelines to determine which other managed forwarding element(i.e., other MFE or DCN-MFE) implements the logical switch to which thedestination data compute node couples. As an example, the source DCN-MFEcould be coupled to a first L2 logical switch, while the destinationDCN-MFE is coupled to a second, different logical switch. However, bothof the first and second logical switches are connected to each otherthrough a logical router.

As such, the DCN-MFE 960 executes the three forwarding pipelines of allthree L2 and L3 switches to determine that the destination DCN isconnected to the second L2 logical switch. The DCN-MFE 960 thenencapsulates the packet with tunneling information, in which, an IPaddress of the source DCN-MFE 960 is the source tunnel endpoint addressand the IP address of the destination MFE or DCN-MFE that implements thesecond logical L2 switch is the destination endpoint address of thetunnel.

The first stage 905 also shows that after the packet 970 is processed bythe DCN-MFE 960, the packet is sent to the MFE 940 operating in thehypervisor 930 to be sent to the source PNIC 920. Additionally, thefirst stage shows that the tunneling information 980 has been added tothe packet 970 (e.g., stored in the outer header of the packet) by theDCN-MFE 960 before sending the packet to the MFE 940.

The second stage 910 shows that the same packet 970 with the sametunneling information 980 is transmitted from the MFE 940 towards thePNIC 920. That is, even though the MFE 940 is primarily for performingnetwork functionalities on the network traffic data, in this particularcase, the MFE 940 only plays the role of a messenger that receives thepacket from the DCN-MFE 960 and delivers the packet to the PNIC 920.This is because, the DCN-MFE 960 has already performed all of therequired forwarding processing and even encapsulated the packet with thetunnel endpoint addresses.

FIG. 10 illustrates an MFE residing in the virtualization software of ahost machine that handles the incoming traffic in the emulationapproach. Specifically, this figure shows, through two stages 1005 and1010, how an MFE operating in the hypervisor of a host machine (i)receives a packet originated from a source DCN-MFE that operates in adifferent host machine, (ii) performs packet forwarding processing onthe received packet, and (iii) sends the packet to a destination DCN-MFEin the emulation approach.

As described above, in some embodiments, the MFE operating in thevirtualization software of the host machine performs packet forwardingprocessing for the incoming network traffic that is destined for aDCN-MFE of a DCN operating on the host machine. The reason for havingthe DCN-MFE process the outgoing traffic and the hypervisor's MFEprocess the incoming traffic is that some embodiments have theforwarding element that is closer to the source of the packets performthe packet processing in order to increase the network trafficefficiency.

Similar to the FIG. 9, this figure shows a host machine 915 thatincludes a PNIC 920 and a hypervisor 930. The hypervisor 930 executes amanaged forwarding element 940 which performs different networkfunctionalities for the virtual machines executing in the host machine.In the shown figure, the host machine 915 hosts two virtual machines 950and 955. The virtual machine 950 runs a DCN-MFE 960 (e.g., in the kernelof the virtual machine VM1). The virtual machine 955 runs a DCN-MFE 965(e.g., in the kernel of the virtual machine VM2). Each of these virtualmachines also executes several different applications (not shown) in theuser space of the virtual machine.

In the first stage 1005, the MFE 940 receives a packet 1030 from thePNIC 920 of the host machine that is destined for one of the two virtualmachines 950 and 955 that run in the host machine 915. This stage alsoshows that the packet is encapsulated with the tunneling information1020 (e.g., stored in the outer header of the packet 1030) by a sourceDCN-MFE that has performed the forwarding processing on the packet in asource machine (e.g., a source virtual machine). The tunnelinginformation 1020 shows that the MFE 940 is the destination endpoint ofthe tunnel (e.g., the destination IP address in the outer header of thepacket has the tunnel endpoint IP address of the MFE 940). The tunnelinginformation also includes other information such as the tunnel endpointIP address of the source DCN-MFE that has received the packet from thesource application (in a different host machine).

In other words, when the source DCN-MFE has received the packet from thesource application in the source machine, the source DCN-MFE hasdetermined that the packet is destined for a virtual machine executingin the host machine 915. Since the source DCN-MFE uses the emulationapproach, the source DCN-MFE has encapsulated the packet with the MFE940 as the destination endpoint of the tunnel instead of the DCN-MFE965. The source DCN-MFE has done so by executing all the necessaryforwarding pipelines of logical forwarding elements (e.g., logical L2and L3 switches) that logically connect the source and destination VMsto each other.

The second stage 1010 shows that, after the encapsulated packet 1030 isreceived from the PNIC 920 (e.g., based on the outer headers informationof the packet 1030), the destination MFE 940 decapsulates the packet(i.e., removes the tunneling information added to the packet by thesource DCN-MFE) using the same tunnel protocol that has been used by thesource DCN-MFE to encapsulate the packet. The destination MFE 940 thensends the packet 1030 to the destination DCN-MFE 960 (e.g., based on thedestination MAC address in the inner L2 header of the packet).

The destination DCN-MFE 960, subsequently, sends the packet 1030 to adestination application (not shown) running inside the VM 950. Thedestination DCN-MFE 960 sends the packet to the destination applicationbased on the inner header destination information of the packet (e.g.,the destination port address stored in the inner L4 header of the packetthat is associated with the destination application).

In the pass-through approach of some embodiments, the DCN-MFE of thesource DCN uses the virtualization software of the host machine totransmit the packet to a destination DCN after the source DCN-MFErealizes that the destination DCN operates on the same host machine asthe source DCN. That is, when both of the source and destination DCNsoperate on the same host machine, the source DCN-MFE offloads thepackets destined for the destination DCN on a memory space of the hostmachine that is controlled by the virtualization software (e.g.,hypervisor) and that is shared with the destination DCN. After storingthe packets on the shared memory space, the source DCN-MFE notifies thehypervisor of the offload. In some embodiments, after receiving theoffload notification, the hypervisor notifies the destination DCN-MFEabout the new network traffic (e.g., data packets) that is stored in theshared memory space. The destination DCN-MFE of some such embodimentsreads the packets from the shared memory space upon receiving thenotification from the hypervisor.

In some embodiments, the shared memory space includes one or moreparticular physical pages of a host machine's physical memory that thehypervisor of the host machine assigns as a shared memory space for theDCNs operating on the host machine. In some such embodiments, thehypervisor of the host machine assigns the physical page(s) as sharedmemory space between the DCNs by mapping the physical page(s) to one ormore particular physical pages in each DCN that shares the memory space.In this manner, the same physical pages of the host machine's memorybecome available to two or more DCNs operating on the host machine forwriting to and reading from these shared physical pages.

FIG. 11 conceptually illustrates a process 1100 of some embodiments foremploying the virtualization software (e.g., hypervisor) of a hostmachine in order to exchange network data between two data compute nodesof the host machine in the pass-through approach. In some embodiments,process 1100 is performed by a source DCN-MFE that runs inside a sourcedata compute node (i.e., a DCN that contains the source application thatgenerates the packets) operating on the host machine. The process 1100will be described by reference to FIG. 12, which provides an example fortwo different ways of forwarding data from a source DCN-MFE based on thedestination DCN-MFE being on the same host machine or a different hostmachine.

As shown in FIG. 11, the process 1100 begins by receiving (at 1110) apacket from one of the applications that execute in the virtual machinethat runs the DCN-MFE. As described above, since the DCN-MFE is theforwarding element that performs the different network functionalitiesfor the network to which the virtual machine is connected, all of theguest applications that run on the virtual machine exchange theirincoming and/or outgoing packets with the DCN-MFE.

After receiving the packet (from a source application), the processidentifies (at 1120) the destination path of the packet (e.g., to whichlogical switch the destination DCN is connected) by executing thenecessary forwarding pipelines for the packet. For example, if thesource and destination DCNs are connected to a particular L2 logicalswitch, the process determines whether the particular L2 switch isimplemented by a DCN-MFE that operates on the same host machine, or by aDCN-MFE that operates on a different host machine.

The process of some embodiments makes such a determination based on thesource and destination data that is extracted from the different packetheaders (e.g., source and destination addresses extracted form the L2and L3 packet headers), and comparing this data with the forwarding datathe process receives from the CCP cluster (e.g., from the localcontroller operating on the host machine).

On the other hand, when the source DCN-MFE is coupled to a first L2logical switch and the destination DCN-MFE is coupled to a second,different logical switch, but the first and second logical switches areconnected to each other through a logical router, the process runs theforwarding pipeline of all three LFEs. The process can then determinewhether the second logical L2 switch is implemented by a DCN-MFE thatoperates on the same host machine (i.e., on a DCN that is hosted by thesame host machine) or not.

Based on the identification of the destination path, the processdetermines (at 1130) whether the destination DCN is on the same hostmachine or not. When the process determines that the source anddestination DCNs are not on the same host machine, the process forwards(at 1140) the received packet to either the PNIC or the MFE running inthe hypervisor of the host machine. That is, depending on whether theprocess uses the emulation approach or the pass-through approach, thepacket could be sent to the MFE or the PNIC of the host machine,respectively. The process then ends.

FIG. 12 shows, through three different stages 1205-1215, a sourceDCN-MFE sending a first packet to the PNIC of the host machine and asecond packet to the hypervisor of the host machine. This figureincludes a host machine 1220 that includes a PNIC 1225 and a hypervisor1230. The host machine 1220 hosts two virtual machines 1250 and 1255.The virtual machine 1250 runs a DCN-MFE 1260 (e.g., in the kernel of thevirtual machine VM1). The virtual machine 1255 runs a DCN-MFE 1265(e.g., in the kernel of the virtual machine VM2). Each of these virtualmachines also executes several different applications (not shown) in theuser space of the virtual machine.

In the first stage 1205, a source application in the virtual machine1250 has generated a packet to be sent to a destination applicationexecuting in a destination data compute node that operates on adifferent host machine (not shown). The source application sends thepacket (e.g., through a communication channel instantiated in thevirtual machine for such communications) to the DCN-MFE 1260. When theDCN-MFE 1260 receives the packet, the DCN-MFE determines whether thepacket is destined for an end machine in the same logical network towhich the VM 1250 is connected or it is destined for a different logicaland/or physical network.

When the DCN-MFE 1260 determines that the packet is destined for anotherdata compute node that belongs to the same logical network but on adifferent host machine, the DCN-MFE executes all the necessaryforwarding pipelines to determine which other managed forwarding element(i.e., other MFE or DCN-MFE) implements the logical switch to which thedestination data compute node couples. That is, the DCN-MFE 1260extracts data that is stored in the different packet headers andcompares the extracted data with the forwarding data that the packetreceives from the control plane. By comparing the data, the DCN-MFE 1260identifies that another MFE or DCN-MFE that operates on another hostmachine implements the logical switch to which the destination DCN isconnected. As such, the DCN-MFE 1260 encapsulates the packet with thenecessary tunneling information (e.g., destination tunnel endpointaddress) and forwards the processed packet 1270 to the PNIC 1225.

Returning to FIG. 11, when the process 1100 determines (at 1130) thatthe source and destination DCNs are on the same host machine, theprocess forwards (at 1150) the received packet to the hypervisor runningin the host machine that hosts the source DCN. In some embodiments, aswill be described below, the process stores the packet in a sharedmemory space of the host machine that the hypervisor controls (e.g., ina physical page of the source VM that is mapped to a physical page ofthe host machine). The destination DCN is then notified about thearrival of the new packet in the shared memory space and the control ofthe shared memory space is passed from the source DCN to the destinationDCN. The destination DCN, upon receiving the notification, reads the newpacket from the shared memory space (e.g., from a physical page of thedestination VM that is mapped to the same physical page of the hostmachine). The process then ends.

The second stage 1210 of FIG. 12 shows that the DCN-MFE 1260 hasprocessed a second packet 1280 and identified the destination DCN of thepacket to be on the same host machine as the source DCN. As such, theDCN-MFE 1260 simply stores the packet on a shared memory space 1240 anddoes not encapsulate the packet with any tunneling information. That is,the DCN-MFE 1260 writes the packet on a guest physical page that isdefined for the host DCN and that is mapped to a host physical pagecontrolled by the hypervisor of the host machine.

The third stage 1215 shows that the destination DCN-MFE 1265 has beennotified of the arrival of the new packet 1280 on the shared memoryspace and as such, the destination DCN-MFE 1265 reads the packet 1280from the shared memory space 1240. After the destination DCN-MFE 1265receives the packet, the destination DCN-MFE forwards the packet to thedestination application based on the information in the destinationpacket headers (e.g., the destination port address in the L4 packetheader).

FIG. 13 illustrates a more detailed example of exchanging network databetween a source DCN-MFE and a destination DCN-MFE in a pass-throughapproach, when the virtual machines containing the source anddestination DCN-MFEs operate on the same host machine. This figureshows, through four different stages 1305-1320, a source DCN-MFE storinga packet in a shared memory space of a hypervisor of the host machinefor a destination DCN-MFE to read the packet from the memory space.

FIG. 13 includes a host machine 1325 that includes a hypervisor 1330 anda physical memory 1345. The host machine 1325 hosts two virtual machines1350 and 1355. The virtual machine 1350 runs a DCN-MFE 1360 (e.g., inthe kernel of the virtual machine VM1). The virtual machine 1355 runs aDCN-MFE 1365 (e.g., in the kernel of the virtual machine VM2). Each ofthese virtual machines also executes several different applications (notshown) in the user space of the virtual machine. The physical memory1345 includes a particular memory space (physical page) 1340 that thehypervisor of some embodiments assigns as a shared memory space for thevirtual machines operating on the host machine by mapping this physicalpage 1340 to a particular physical page in each virtual machine (notshown).

The first stage 1305 shows that the DCN-MFE 1360 has processed a packet1370 and identified that the destination DCN of the packet is on thesame host machine as the source DCN. As described above, the DCN-MFE1360 of some embodiments performs the forwarding processing on thepacket 1370 by extracting the source and destination information indifferent layer headers (e.g., L2 header, L3 header, etc.) of the packetand comparing the extracted information with the forwarding informationthe DCN-MFE 1360 receives from one or more central controllers of theCCP cluster.

The DCN-MFE 1360, after performing the forwarding processing on thepacket, determines whether the logical switch port, to which thedestination DCN is connected is implemented by a destination DCN-MFE inthe same host machine or a different host machine. For example, whenboth of the VM1 and VM2 shown in this figure are connected to the samelogical L2 switch, by performing the forwarding processing, the DCN-MFE1360 identifies the logical port of the logical L2 switch, to which thedestination VM2 is connected (e.g., by looking at the destination MACaddress in the destination L2 header of the packet).

Furthermore, when the DCN-MFE 1360 performs the forwarding pipeline ofthe logical L2 switch, based on the forwarding tables configured in theDCN-MFE 1360 (by the control plane), the DCN-MFE 1360 realizes that theDCN-MFE 1365 is the forwarding element that implements the port of thelogical L2 switch that is connected to the destination VM 1355.Therefore, the source DCN-MFE 1360 does not use the direct communicationchannel to the PNIC of the host machine and instead, uses the sharedmemory space with the destination DCN-MFE 1365 to send the packet tothis destination DCN-MFE.

Because the source and destination DCNs operate on the same hostmachine, the DCN-MFE 1360 identifies a memory space that the sourceDCN-MFE 1360 and the destination DCN-MFE 1365 share for reading andwriting the packets exchanged between the two forwarding elements.Although the first stage 1305 shows that the packet 1370 is being sentfrom the DCN 1350 to the physical memory 1345, in reality, the sourceDCN-MFE 1360 writes the packet in a particular virtual page of the DCN1350. In some embodiments, this particular virtual page is mapped to aparticular physical page of the guest machine (i.e., the VM 1350) by theguest operating system, while the particular physical page of the guestmachine is mapped to a shared physical page of the host machine by thehypervisor 1330. Therefore, when the DCN-MFE 1360 writes the packet inthe particular virtual page, the packet is written in the physical page1340 of the host machine's physical memory 1345, which is shared withthe destination VM 1355. That is, the destination VM 1355 also has avirtual page in its memory that is mapped (in the same manner describedabove) to the shared physical memory 1340.

The second stage 1310 shows that, after writing the packet in the sharedmemory space, the source DCN-MFE 1360 sends a notification 1380 to thehypervisor 1330, informing the hypervisor that a new packet is availablein the shared memory. In some embodiments, the notification message sentto the hypervisor also informs the hypervisor that the new packet shouldbe read by the destination DCN-MFE 1365 and not any other DCN thatshares the same memory space. In some other embodiments, the memoryspace is only shared between these two particular DCNs and as such, thenotification message only notifies the hypervisor of the arrival of newpacket. Additionally, although the illustrated example shows that thesource DCN-MFE notifies the hypervisor after writing only one packet inthe physical page, it should be understood that in some embodiments,such notification is sent to the hypervisor after a particular number ofpackets are stored in the shared memory space. The second stage alsoshows that the packet 1370 is now stored in the physical page 1340 ofthe physical memory 1345 of the host machine 1325.

In some embodiments, the shared page of the host machine's physicalmemory is accessible, at any particular time, by only one DCN. That is,when the hypervisor passes the control of the shared memory page fromthe source DCN to the destination DCN (e.g., to read from the page), theshared page is no longer accessible by the source DCN. There could bedifferent reasons for not allowing two DCNs to have control over a samememory page, e.g., to avoid concurrent transaction on the same page, toavoid spread of an attack from a source DCN to a destination DCN, etc.

In some embodiments, when the hypervisor is notified of the arrival ofnew network traffic (i.e., one or more new packets) in the shared memoryspace, the hypervisor notifies the destination DCN-MFE to read the newnetwork traffic from the shared memory. The third stage 1315 shows sucha notification. Specifically, this stage shows that the hypervisor sendsa notification 1390 to the DCN-MFE 1365, informing this forwardingelement that a new packet 1370 is stored in the shared physical page1340. In some embodiments, this notification is sent to the destinationDCN-MFE 1365 each time the hypervisor is notified of the arrival of newtraffic by the source DCN-MFE 1360. In some other embodiments, thehypervisor notifies the destination DCN-MFE 1365 when the hypervisorreceives a particular number of notifications from the source DCN-MFE1360. Yet, in some other embodiments, the hypervisor notifies thedestination DCN-MFE 1365, when a particular number of packets are storedin the shared physical page 1340.

The fourth stage 1320 shows that the destination DCN-MFE 1365 reads thenew packet 1370 from the shared memory space after the destinationDCN-MFE 1365 is notified by the hypervisor 1330. Even though the fourthstage shows that the packet 1370 is being sent from the physical memory1345 to the destination DCN-MFE 1365, in reality, the destinationDCN-MFE 1365 reads the packet from a particular virtual page of the DCN1355. In some embodiments, this particular virtual page is mapped to aparticular physical page of the guest machine (i.e., the VM 1355) by theguest operating system, while the particular physical page of the guestmachine is mapped to the shared physical page of the host machine by thehypervisor 1330. Therefore, when the DCN-MFE 1365 reads the packet fromthe particular virtual page, the packet is in fact read from thephysical page 1340 of the host machine's physical memory 1345, which isshared with the destination VM 1355.

In some embodiments, each time the destination DCN-MFE reads from theshared memory (i.e., one or more physical pages of the physical memoryof the host machine), the destination DCN-MFE removes the networktraffic that is read from the memory. Some embodiments free the sharedphysical pages from network traffic when a particular number of times adestination DCN-MFE reads from these shared physical pages. Yet, in someother embodiments, the shared memory is cleaned up periodically. Thatis, the hypervisor deletes the packets from the shared memory spacesafter a certain period of time lapses.

II. Securing a DCN-MFE within a DCN

Since the DCN-MFE of some embodiments is instantiated (and operates) ina DCN (e.g., as one of the drivers of the DCN), the DCN-MFE is morevulnerable to malicious attacks in comparison with an MFE that isinstantiated (and operates) in a hypervisor of a host machine. This isbecause, although the DCN-MFE is instantiated in the kernel of a guestoperating system (e.g., in the network stack of the kernel), in someembodiments, the DCN-MFE is still exposed to other applications andprocesses that run by the guest operating system. In contrary, an MFEthat operates in the hypervisor of a host machine is solely controlledby the central control plane (i.e., the CCP cluster) of the hostingsystem and is not exposed to any outside applications and/or processes.

In order to protect the DCN-MFE from malicious attacks, some embodimentsmark the pages that contain the code and data of the DCN-MFE (e.g., thememory space of the host machine on which the DCN-MFE's code and dataare loaded) as read-only to the guest operating system. Some suchembodiments only allow the hypervisor to write on the pages that aremarked as read-only for the guest operating system. Although thisapproach protects the DCN-MFE from being modified by the guest operatingsystem, a malicious module may still attack the DCN-MFE by loading ontothe guest kernel and simulating the functionalities of the DCN-MFE. Thatis, a malicious module loads onto the guest kernel and communicates withthe VNIC (of the DCN) or the PNIC (of the host machine) in the same waythat the DCN-MFE does, hence exposing these interfaces to maliciousattacks.

In addition to marking the memory as read-only memory, some embodimentscheck one or more particular data structures of the guest kernel (e.g.,in the same manner as antivirus programs do) to ensure that the DCN-MFEis the only module that communicates with the PNIC and/or VNIC through acommunication channel. Some such embodiments check the particular one ormore data structures periodically, while other embodiments check theparticular data structures when a certain number of packets are receivedat the PNIC and/or VNIC.

Marking the memory pages of the kernel that store the code and data ofthe DCN-MFE, however, is not enough to protect the DCN-MFE fromkernel-level attacks or malicious programs, such as rootkits (e.g., amalicious software that masks itself as being the DCN-MFE). The possiblerootkit attacks include attempts to unload the DCN-MFE instance from thekernel of the DCN or prevent the DCN-MFE instance from loading. Arootkit attack may also include tampering with the DCN-MFE code or datathat are on physical memory (e.g., of the virtual or host machine) andtampering with the communication channels of the DCN-MFE instance withother network elements such as the hypervisor and/or PNIC of the hostmachine.

Some embodiments protect the DCN-MFE from such malicious attacks byseparating the memory space (e.g., in the host machine's physicalmemory) in which the code and data of the DCN-MFE are loaded (guestsecure domain) from the memory space in which other applications andprocesses of the DCN are loaded (guest general domain). In someembodiments, the other applications and processes that are stored in theguest general memory space include the guest user space applications aswell as the processes and drivers that are loaded in the guest kernel.Some embodiments store additional data and modules in the guest securedomain, in which the DCN-MFE is loaded, in order for the two guestdomains to be able to communicate with each other in a secure manner.These additional data and module that are stored in the guest securedomain are described below by reference to FIG. 15.

Conventionally, when a data compute node is loaded in a host machine(e.g., into the host machine's physical memory), the hypervisor of thehost machine creates and uses a set of nested page tables (NPTs) to mapthe guest physical memory space of the DCN to a portion (i.e., a set ofpages) of the host physical memory space. In order to separate the guestsecure domain from the guest general domain, the hypervisor of someembodiments creates two sets of NPTs for each DCN that is loaded in thehost machine (i.e., that starts operating on the host machine). In somesuch embodiments, the hypervisor creates a first set of NPTs (alsoreferred to as secure NPTs) and a second set of NPTS (also referred toas general NPTs). The secure NPTs include a set of tables that maps theguest physical memory addresses that contain the DCN-MFE (code and data)to the guest secure domain. Similarly, the general NPTs include a set oftables that maps the guest physical memory addresses that contain otherapplications and processes to the guest general domain.

FIG. 14 conceptually illustrates a process 1400 that some embodimentsperform in order to isolate a guest secure domain in the physical memoryof a host machine for loading the code and data of a DCN-MFE of a datacompute node. In some embodiments, the process 1400 is performed by ahypervisor of the host machine (e.g., by a virtual memory monitor (VMM)in the hypervisor). The process 1400 begins by receiving (at 1410) anotification that a DCN-MFE has been loaded on a data compute node. Insome embodiments, the hypervisor is notified that a DCN-MFE is loaded ina guest VM, when the VM is loaded in the host machine and the DCN-MFE isloaded in the guest physical memory of the VM. In some embodiments thehypervisor (or the VMM in the hypervisor) also identifies the guestvirtual addresses of the executable code and data regions of theDCN-MFE.

The process then authenticates (at 1420) the loaded DCN-MFE (i.e., theexecutable code and data of the DCN-MFE loaded into memory). In someembodiment, upon receiving the notification of loading the DCN-MFE, thehypervisor performs a signature verification to verify the authenticityof the DCN-MFE. The purpose of the signature verification is to ensurethat the executable code and data of the DCN-MFE are the same as theoriginal code and data of the DCN-MFE. In other words, the initialsignature verification verifies that the code and data of the DCN-MFEhave not been modified, or tampered with, by a malicious software.

In order to perform signature verification, the hypervisor of someembodiments issues a request to a service appliance to validate theexecutable code and data regions of the DCN-MFE by comparing theseregions against known valid signatures. Any technically feasible methodmay be used for the validation so long as the validation is orchestratedoutside the guest because a rogue agent may attempt to disable any suchvalidation within the guest. For example, some embodiments decrypt ahashed valid version of the executable code and data regions of theDCN-MFE using a public key of the creator of the DCN-MFE, and comparethe hashed valid version against the in-memory image of the executablecode and data regions of the DCN-MFE. Some other embodiments use othermethods.

The process 1400 then determines (at 1430) whether the DCN-MFE haspassed the authentication test. In some embodiments, the process treats(at 1440) the DCN containing the DCN-MFE as being under malicious attackwhen the signature verification operation for authenticating theexecutable code and data of the DCN-MFE fails. That is, the processterminates the DCN, or alternatively drops all the packets received fromthe DCN. On the other hand, when the process determines that the DCN-MFEis authenticated, the process creates (at 1450) a secure set of NPTs anda general set of NPTs.

As described above, the secure set of NPTs includes a set of tables thatmaps the guest physical memory addresses that contain the DCN-MFE (i.e.,executable code and data of the DCN-MFE) to the guest secure domain inthe physical memory of the host machine. Similarly, the general set ofNPTs includes a set of tables that maps the guest physical memoryaddresses that contain other applications and processes to the guestgeneral domain in the physical memory of the host machine.

In some embodiments, a hypervisor of the host machine creates a set oforiginal NPTs to map the guest DCN's physical memory to the hostmachine's physical memory when the guest DCN is loaded into the memoryof the host machine. In some such embodiments, the secure and generalNPTs are created from the originally generated NPTs. In particular, insome embodiments, the mappings of guest physical memory addressescorresponding to the executable code and data regions of the DCN-MFE aremoved, from the original NPTs, into the secure NPTs and the othermappings are moved into the general NPTs.

Next, the process 1400 stores (at 1460) the DCN-MFE and other relatedsecurity data in the guest secure domain of the physical memory of thehost machine in order to isolate the code and data of the DCN-MFE fromother code and data of the DCN. Particularly, the process uses thecreated secure NPTs to map the physical memory of the guest VM thatcontains the DCN-MFE's code and data to the guest secure domain of thephysical memory of the host machine.

The process then stores (at 1470) other applications and processes ofthe guest VM in the guest general domain. Particularly, the process usesthe created general NPTs to map the physical memory of the guest VM thatcontains the other applications' code and data, as well as otherprocesses, to the guest general domain of the physical memory of thehost machine. In some embodiments, the other applications and processesthat are stored in the guest general memory space include the guest userspace applications as well as the processes and drivers that are loadedin the guest kernel.

The specific operations of the process 1400 may not be performed in theexact order shown and described. The specific operations may not beperformed in one continuous series of operations, and different specificoperations may be performed in different embodiments. For example, someembodiments, upon receiving a notification that the DCN-MFE is loaded,mark the page table entries of the memory locations that store the codeand data of the DCN-MFE to be read-only, so that no other guest threadrunning on other virtual CPUs can modify the memory state of theexecutable code and data regions of the DCN-MFE.

Furthermore, the process could be implemented using severalsub-processes, or as part of a larger macro process. For example, insome embodiments, the operation 1450 that creates the secure and generalNPTs, uses a sub-process to create secure guest page tables from theoriginal guest page tables and store them in the guest secure domain ofthe physical memory. In some such embodiments, the hypervisor moves theoriginal guest page table entries that point to guest physical memorypages that are mapped by the secure NPTs into the guest secure domain.Creating the secure guest page tables and storing them in the guestsecure domain prevents a rogue agent from changing the mapping of aguest virtual memory address which points to a guest physical memoryaddress that is mapped in secure NPTs.

FIG. 15 illustrates a memory mapping system that some embodiments employto isolate a guest secure domain from the guest general domain in thephysical memory of the host machine. In some embodiments there are twotypes of memory mappings that map the memory of a DCN to the memory ofthe host machine on which the DCN executes. The first type is a mappingfrom the guest virtual memory space to the guest physical memory space.This type of mapping is managed by guest DCN operating system and isencoded in guest page tables. As in physical computer systems, the guestpage tables of some embodiments are provided per process. The secondtype of memory mapping is a mapping from the guest physical memory spaceto the host physical memory space. This type of mapping is managed bythe virtualization software of the host machine (e.g., the hypervisor),and is encoded in nested page tables (NPTs).

Conventionally, for each DCN that starts operating on the host machine(e.g., loaded into the memory of the host machine) one set of nestedpage tables is created that maps the memory of the DCN to the physicalmemory of the host machine. Some embodiments generate two sets of NPTs,as described above, from the original set of NPTs. These two setsinclude secure NPTs and general NPTs. Secure NPTs map guest physicalmemory addresses to a guest secure domain of the host physical memorywhile general NPTs map guest physical memory addresses to a guestgeneral domain of the host physical memory.

FIG. 15 shows a DCN 1505, a hypervisor 1510, and a host physical memory1520. The DCN 1505 includes a set of guest applications 1515 in the userspace of the DCN, a DCN-MFE 1525 that operates in the kernel of the DCN,and a set of other processes that are also loaded in the kernel of theDCN. The hypervisor 1510 includes a set of secure NPTs 1540 and a set ofgeneral NPTs 1550. The host physical memory 1520 includes a guest securedomain 1580 and a guest general domain 1590. The guest secure domaincontains the executable code 1555 and data 1560 of the DCN-MFE, a set ofsecure guest page tables 1565, and a switching module 1570. The guestgeneral domain contains the other applications code and data and otherprocesses 1575 of the DCN.

This figure shows how the mappings of the two separate NPTs isolates theguest secure domain from the guest general domain, and thereby theDCN-MFE from the other application and processes, in the host machine'sphysical memory. Particularly, the figure shows that the secure NPTs1540 generated in the hypervisor 1510 of the host machine map theDCN-MFE 1525 to the guest secure region 1580 of the physical memory. Onthe other hand, the general NPTs 1550 generated in the hypervisor 1510of the host machine map the guest applications 1515 and the other kernelprocesses 1535 to the guest general region 1580 of the physical memory.

Some embodiments, in addition to the executable code and data of theDCN-MFE, store a set of secure guest page tables 1565 and a switchingmodule 1570 in the guest secure domain 1580. The secure guest pagetables 1565 are created from the original guest page tables that containthe code and data of the DCN-MFE and include guest page table entriesthat point to guest physical memory pages that are mapped by the secureNPTs into the guest secure domain. As stated above, creating the secureguest page tables and storing them in the guest secure domain prevents arogue agent from changing the mapping of a guest virtual memory addresswhich points to a guest physical memory address that is mapped in secureNPTs.

In some embodiments, the data that is stored in the guest secure domain1580 is not mapped in the guest general domain and is only accessible tocode that executes in the guest secure domain such as the executablecode of the DCN-MFE. As such, confidential information can be stored inthe guest secure domain 1580 without any risk of being exposed even ifthe guest OS is compromised.

The switching module 1570 is deployed by some embodiments for switchingbetween the guest secure domain and the guest general domain. That is,this module is called as a secure way to enter into or exit out of theguest secure domain. In some embodiments, the operating system of theguest DCN calls the switching module, which is stored in the guestsecure domain but is mapped executable from the guest general domain,and causes the switching module to enter the guest secure domain (e.g.,to pass the execution control from the guest general domain to the guestsecure domain). On the other hand, in some embodiments, the DCN-MFEcalls the switching module and causes this module to exit the guestsecure domain (e.g., to pass the execution control from the guest securedomain to the guest general domain).

In order to employ this switching module, the hypervisor of the hostmachine first determines whether or not there is an available securethread stack from a secure thread stack pool. If not, the hypervisorreturns a busy status to switching module. If there is an availablesecure thread stack or when a secure thread stack becomes available, thehypervisor selects a secure thread stack. Then, the hypervisor obtainsthe return address from the instruction pointer in the guest generaldomain and saves the return address in a temporary register.

The hypervisor then changes the pointer to the guest page tables(currently pointing to guest page tables stored in the guest generaldomain), and the NPT pointer, which is the pointer to the nested pagetables (currently pointing to the general NPTs), so that the guest pagetable pointer points to guest page tables stored in the guest securedomain and the NPT pointer points to secure NPTs. After the page tablepointers have been changed, the hypervisor switches the stack to theselected secure thread stack and pushes the return address in thetemporary register onto the selected stack.

The hypervisor then sets the instruction pointer to an entry point tothe secure protection domain and resumes guest execution, as a result ofwhich execution control is transferred to a secure protection domaindispatcher. The secure protection domain dispatcher performs avalidation of an entry number (passed as a parameter when the switchingmodule was called) and if the entry number is validated, allowsexecution of the DCN-MFE in the secure protection domain. Validation ofthe entry number consists of a check that the entry number correspondsto a defined service. Dispatch can be done through a jump table, binarydecision tree, or other mechanism that transfers control flow from thedispatch routine to the code associated with the indicated service.

When the DCN-MFE calls into the switching module it causes the module toexit the secure domain. To do this, the hypervisor pops the returnaddress from the stack and saves the return address in a temporaryregister. Then, the hypervisor changes the guest page table pointer GPTand the NPT pointer, so that the guest page table pointer points to thegeneral guest page tables and the NPT pointer points to general NPTs.After the page table pointers have been changed, the hypervisor switchesthe stack back to the thread stack in the guest general domain andreturns the current secure thread stack back to the secure thread stackpool. The hypervisor then sets the instruction pointer to the returnaddress stored in a temporary register and resumes guest execution inthe guest general domain.

Instead of using a separate secure domain for the code and data of theDCN-MFE, some embodiments employ a counter check security agent in orderto protect the DCN-MFE against malicious attacks. In some embodimentsthe counter check security agent operates in the virtualization softwareof the host machine. The counter check security agent of someembodiments receives a message from the DCN-MFE to increase a localcounter value by n (n being an integer greater than or equal to one).The counter check security agent receives this message when the DCN-MFEtransmits n packets (1) to a PNIC of the host machine directly (e.g., inthe pass-through approach), or alternatively (2) to a VNIC of the DCN tobe transmitted to the MFE of the virtualization software (e.g., in theemulation approach).

FIG. 16 conceptually illustrates a process 1600 that some embodimentsperform to protect a DCN-MFE of a data compute node against maliciousattacks by using a packet counter value. In some embodiments a countercheck security agent that operates in the hypervisor of a host machineperforms this process. The process 1600 begins by receiving (at 1610) amessage from the DCN-MFE to increase the current value of a counter thatis local to the security agent. In some embodiment the message instructsthe security agent to increase the local value by n when the DCN-MFEtransmits n packet out. In some embodiments the current value of thecounter is stored in a physical memory of the host machine that is onlyaccessible by the counter check security agent.

The process then increases (at 1620) the local counter's value with thepacket number that was included in the received counter increasemessage. For example, the security agent increments the value of thelocal counter by one when the security agent receives a counter increasemessage from the DCN-MFE, in which the message indicates that only onepacket is transmitted out from the DCN-MFE.

The process of some embodiments, after increasing the counter value,retrieves (at 1630) a packet counter value from the PNIC of the hostmachine (e.g., in the pass-through approach) and/or the VNIC of the DCN(e.g., in the emulation approach). The retrieved counter value, in someembodiment, shows the total number of packets received at the PNICand/or VNIC. That is, in some embodiments, the PNIC and/or VNICincreases a packet counter value each time the interface receives apacket from the DCN-MFE (and/or any other module including a potentialmalicious module). In some other embodiments, the PNIC and/or VNICincreases the packet counter value each time the interface sends apacket to other network elements.

The process then determines (at 1640) whether the packet number valueretrieved from the PNIC and/or VNIC is equal to the local counter valueof the security agent. When the counter value kept in the local counter(after increasing the local counter value by n) of the counter checksecurity agent is the same as the number retrieved from the PNIC and/orVNIC, the process determines that the DCN is in a normal condition(i.e., the DCN is not under any type of malicious attack) and ends.

On the other hand, if the two numbers (i.e., the local number and theretrieved number from the PNIC and/or VNIC) do not match, the process ofsome embodiments treats (at 1650) the DCN executing the DCN-MFE as beingunder malicious attack and notifies the virtualization software (e.g.,the hypervisor) of such. The process then ends. In some embodiments, thevirtualization software then takes the necessary steps to prevent themalicious attack of being spread (e.g., by dropping any additionalpackets received from the DCN, by terminating the DCN, etc.).

FIG. 17 illustrates an example of a counter check security agentoperating in a hypervisor of a host machine that protects a DCN-MFEagainst a malicious attack in the pass-through approach. Specifically,this figure shows, through four different stages 1705-1720, how acounter check security (CC S) agent receives different counter valuesfrom a DCN-MFE and a PNIC and compares these values to identify amalicious attack on a DCN that runs the DCN-MFE. The figure shows a hostmachine 1725 that includes a PNIC 1730 and a hypervisor 1735. The hostmachine also executes a DCN (e.g. a virtual machine VM) 1740 thatincludes a DCN-MFE 1760 for forwarding processing. The hypervisor 1735executes a counter check security agent 1750 for protecting the DCN-MFE1760 against any potential malicious attack.

In the first stage 1705, the local counter value of the CCS agent 1750in the hypervisor 1735 is n−1. That is, as of this moment, the CCS agenthas received one or more messages from the DCN-MFE 1760 in which, theDCN-MFE indicated to the CCS agent that the DCN-MFE has transmitted n−1packets out to the PNIC 1730 so far. The first stage also shows that thePNIC 1730 includes a packet counter that currently has the same value ofn−1, which shows this PNIC has received n−1 packets from the DCN-MFE1760. In some embodiments, each DCN is associated with a single PNIC (inthe pass-through approach) and therefore, the packet count that thePNIC's counter holds shows precisely the number of packets that the PNIChas received from the DCN.

In some embodiments, a PNIC can be a physical NIC of the host machine ora simulated PNIC from a set of simulated PNICs that are simulated fromthe physical NIC of the host machine. The first stage 1705 also showsthat the DCN-MFE 1750 has transmitted (1) a packet 1765 (Pn) towards thePNIC 1730 and (2) a counter increase message 1770 towards the CCS agent1750, which instructs the CCS agent to increase the local counter valueby one.

The second stage 1710 shows that the CCS agent 1750 has received themessage 1770 and as a result, has incremented the value of the localcounter. As such, at this stage, the value of the local counter haschanged from n−1 to n. Similarly, the PNIC 1730 has received the packet1765 from the DCN-MFE 1760 and as a result, has incremented the value ofthe packet counter. Therefore, the value of the packet counter haschanged from n−1 to n.

The third stage 1715 shows that the DCN-MFE 1750 has transmitted (1)another packet 1790 (Pn+1) towards the PNIC 1730 and (2) another counterincrease message 1770 towards the CCS agent 1750, which instructs theCCS agent to increase the local counter value by one. Furthermore, thethird stage shows that a malicious module 1780, which imitates theDCN-MFE 1750, is sending a packet to the PNIC 1730, e.g., through thesame channel that the DCN-MFE communicates with the PNIC. In someembodiments, the malicious module 1780 executes in the kernel of the VM(same as the DCN-MFE) and takes over some of the operations of theDCN-MFE, or alternatively, performs these operations in parallel withthe DCN-MFE (as shown in the illustrated example).

The fourth stage 1720 shows that the CCS agent 1750 has received thesecond message 1770 and as a result, has incremented the value of thelocal counter. As such, at this stage, the value of the local counterhas changed from n to n+1. However, the PNIC 1730 has received an extrapacket 1785 from the malicious module 1780, in addition to the packet1790 that it receives from the DCN-MFE 1760. As a result, the PNIC 1730has increased the value of the packet counter by two. Therefore, thevalue of the packet counter has changed from n to n+2. As describedabove, the CCS agent 1750 of some embodiments retrieves the packetcounter value from the PNIC each time the CCS agent increases the localcounter value. In some embodiments, the CCS agent retrieves the packetcounter value by sending a request to the PNIC asking the PNIC to sendthe current packet counter value to the CCS agent.

At the fourth stage 1720, the CCS agent 1750, after receiving the packetcounter value from the PNIC, compares this value with the value storedin the local counter associated with the CCS agent 1750. As describedbefore, the local counter value is stored in a physical memory of themachine that is controlled by the hypervisor of the host machine in someembodiments. When the CCS agent 1750 compares the local counter value(n+1) with the packet counter value (n+2), the CCS agent realizes thatthe DCN is under attack because these two numbers do not match. The CCSagent 1750 of some embodiments, upon identifying a malicious attack,notifies the hypervisor of such. In some embodiments the hypervisor 1735takes the necessary action to prevent the attack from spreading to otherDCNs or other modules of the same DCN.

FIG. 18 illustrates another example of a counter check security agentoperating in a hypervisor of a host machine that protects a DCN-MFEagainst a malicious attack in the emulation approach. Specifically, thisfigure shows, through four different stages 1805-1820, how a countercheck security (CCS) agent receives different counter values from aDCN-MFE and a VNIC and compares these values to identify a maliciousattack on a DCN that runs the DCN-MFE. The figure shows a DCN (e.g. avirtual machine VM) 1840 and a hypervisor 1835. The DCN 1840 includes aDCN-MFE 1860 for packet forwarding processing and a VNIC 1830 forforwarding the processed packets. The VNIC 1830 of some embodiments is avirtual network interface controller that is associated with a physicalport of the MFE 1845 in order for the DCN 1840 to communicate with amanaged forwarding element of a host machine such as the MFE 1845.

The hypervisor 1835 executes a counter check security agent 1850 forprotecting the DCN-MFE 1860 against any potential malicious attack. Thehypervisor 1835 also executes an MFE 1845 for packet forwardingprocessing for every DCN that (1) executes on a same host machine onwhich the MFE executes and (2) does not include a DCN-MFE to performpacket forwarding processing. The MFE 1845, in some embodiment, is alsofor receiving processed packets from every DCN that (1) executes on asame host machine on which the MFE executes and (2) includes a DCN-MFEthat performs the forwarding processing on the packets but uses the MFEas an intermediary to send the processed packets to a PNIC (not shown)of the host machine in emulation approach.

In the first stage 1805, the local counter value of the CCS agent 1850in the hypervisor 1835 is n−1. That is, as of this moment, the CCS agenthas received one or more messages from the DCN-MFE 1860 in which, theDCN-MFE indicated to the CCS agent that the DCN-MFE has transmitted n−1packets out to the VNIC 1830 so far. The first stage also shows that theVNIC 1830 includes a packet counter that currently has the same value ofn−1, which shows this VNIC has received n−1 packets from the DCN-MFE1860.

In some embodiments, each DCN is associated with a single VNIC andtherefore, the packet count that the VNIC's counter holds showsprecisely the number of packets that the VNIC has received from the DCN.The first stage 1805 also shows that the DCN-MFE 1850 has transmitted(1) a packet 1865 (Pn) towards the VNIC 1830 and (2) a counter increasemessage 1870 towards the CCS agent 1850, which instructs the CCS agentto increase the local counter value by one.

The second stage 1810 shows that the CCS agent 1850 has received themessage 1870 and as a result, has incremented the value of the localcounter. As such, at this stage, the value of the local counter haschanged from n−1 to n. Similarly, the VNIC 1830 has received the packet1865 from the DCN-MFE 1860 and as a result, has incremented the value ofthe packet counter. Therefore, the value of the packet counter haschanged from n−1 to n.

The third stage 1815 shows that the DCN-MFE 1850 has transmitted (1)another packet 1890 (Pn+1) towards the VNIC 1830 and (2) another counterincrease message 1870 towards the CCS agent 1850, which instructs theCCS agent to increase the local counter value by one. Furthermore, thethird stage shows that a malicious module 1880, which imitates theDCN-MFE 1850, is sending a packet to the VNIC 1830, e.g., through thesame communication channel that the DCN-MFE communicates with the VNIC.

The fourth stage 1820 shows that the CCS agent 1850 has received thesecond message 1870 and as a result, has incremented the value of thelocal counter. As such, at this stage, the value of the local counterhas changed from n to n+1. However, the VNIC 1830 has received an extrapacket 1885 from the malicious module 1880, in addition to the packet1890 that it receives from the DCN-MFE 1860. As a result, the PNIC 1830has increased the value of the packet counter by two. Therefore, thevalue of the packet counter has changed from n to n+2. As describedabove, the CCS agent 1850 of some embodiments retrieves the packetcounter value from the VNIC each time the CCS agent increases the localcounter value. In some embodiments, the CCS agent retrieves the packetcounter value by sending a request to the VNIC asking the VNIC to sendthe current packet counter value to the CCS agent.

At the fourth stage 1820, the CCS agent 1850, after receiving the packetcounter value from the VNIC 1830, compares this value with the valuestored in the local counter associated with the CCS agent 1850. Asdescribed before, the local counter value is stored in a physical memoryof the machine that is controlled by the hypervisor of the host machinein some embodiments. When the CCS agent 1850 compares the local countervalue (n+1) with the packet counter value (n+2), the CCS agent realizesthat the DCN is under attack because these two numbers do not match. TheCCS agent 1850 of some embodiments, upon identifying a malicious attack,notifies the hypervisor of such. In some embodiments the hypervisor 1835takes the necessary action to prevent the attack from spreading to otherDCNs or other modules of the same DCN.

In some embodiments, a determined malicious module that simulates theDCN-MFE in the guest kernel may also imitate the communication betweenthe DCN-MFE and the counter check security agent. In some embodiments,the DCN-MFE and counter check security agents communicate with eachother through a channel that is essentially a software function. Themalicious module, in some such embodiments, may call the same counterincrease function that the DCN-MFE calls. By calling the same function,the malicious module also sends a counter increase message to thesecurity agent to increase the local counter by n, each time themalicious module transmits n packets to the PNIC and/or VNIC. In otherwords, the malicious module imitates both functions of the DCN-MFE totransmit the packets out to the PNIC and/or VNIC, and to send a counterincrease message to the counter check security agent with eachtransmission.

In order to protect the DCN-MFE against this type of malicious modules,the hypervisor of some embodiments generates a list of valid returnaddresses, each of which indicates a valid return address of asubsequent instruction after the last instruction of the counterincrease function is executed. That is, each return address in the listof valid return addresses contains a memory address that a subsequentexecuting instruction pointer may point to after the last instruction ofthe counter increase function called by the DCN-MFE is executed.Additionally, each time any module (e.g., a DCN-MFE or a maliciousmodule) calls the counter increase function, that module stores thereturn address of the next instruction, that has to be executed afterthe counter increase function returns, in a call stack.

In some embodiments, each time a counter increase message is received,the counter check security agent checks the call stack maintained by theDCN, which contains the return address after the counter increasefunction is finished. In some other embodiments a different securityagent (other than the counter check security agent) that runs in thecommunication channel between the DCN-MFE and the hypervisor (e.g.,inside the hypervisor or the DCN) checks the call stack. The securityagent then matches the return address in the call stack of the DCNagainst the list of valid return addresses (that are kept in localstorage of the hypervisor or DCN). When no match is found, the securityagent determines that a separate module (which has a different returnaddress for the subsequent instruction) has called into the counterincrease function and notifies the virtualization software of apotential malicious attack on the DCN.

FIG. 19 conceptually illustrates a process 1900 of some embodiments thatprotects a DCN-MFE of a data compute node against a malicious modulethat imitates the DCN-MFE in sending counter increase messages to thecounter check security agent. In some embodiments, the process 1900 isperformed by a counter check security (CCS) agent that operates in thehypervisor of a host machine. In some other embodiments, a differentsecurity agent that also operates in the hypervisor of the host machine,along with the counter check security agent, performs the process 1900.

The process 1900 begins by receiving (at 1910) a message from theDCN-MFE to increase the current value of a counter that is local to theCCS agent. In some embodiment the message instructs the CCS agent toincrease the local value by n when the DCN-MFE transmits n packet out.In some embodiments the current value of the counter is stored in aphysical memory of the host machine that is only accessible by thecounter check security agent (through a hypervisor that executes thesecurity agent).

The process of some embodiments, upon receiving the message and beforeincreasing the local counter value, retrieves (at 1920) the last returnaddress from a call stack that is maintained by the DCN. The process ofsome embodiments retrieves the return address by querying a call stackdata storage that is maintained by the data compute node. The call stackdata storage, as described above, contains the return address after thecounter increase function is finished. A module (or program) has severallines of instructions. When an instruction line calls a function toperform an operation, the function has to know the return address inorder to pass the control of the processor to the instruction that isafter the instruction that called the function. That is, the returnaddress contains a pointer to the address of a subsequent instructionline in the module (or program) that is after the instruction line thatcalls the function.

As such, when a DCN-MFE (or a malicious module) calls the counterincrease function, the return address of a subsequent instruction lineof a program of the DCN-MFE (or the malicious module) from which thecounter increase function is called, is stored in the call stack (e.g.,by the operating system or by the malicious module). Therefore, theprocess retrieves, from the call stack, the return address that belongsto an instruction of a program that is being executed by the DCN-MFE orby a malicious module that is imitating the DCN-MFE.

In order to ensure that the counter increase function has been called bythe DCN-MFE (i.e., the counter increase message has been received fromthe DCN-MFE), the process of some embodiments matches (at 1930) theretrieved return address against a data storage that contains a list ofvalid return addresses. In some embodiments, each return address in thelist contains a return address of a subsequent instruction after thelast instruction of the counter increase function is executed. That is,each return address in the list of valid return addresses contains amemory address that a subsequent instruction pointer of the processormay point to, after the last instruction of the counter increasefunction is executed. In some embodiments, each time a DCN-MFE is loadedinto the memory of the host machine, the security agent running in thehypervisor of the host machine (or another module running in thehypervisor) generates the list of valid return addresses and stores thelist in a data storage controlled by the hypervisor.

When the process finds a match, the process realizes that the counterincrease message has been received from a legitimate DCN-MFE. Theprocess then ends. On the other hand, when the process does not find amatch in the list of valid return addresses, the process determines (at1950) that a malicious module (which has a different return address forthe subsequent instruction) has called the counter increase function andas such treats the DCN as being under malicious attack. That is, in someembodiments, the process notifies the hypervisor of a potentialmalicious attack on the DCN. In some such embodiments, the hypervisortakes the necessary actions (e.g., terminates the DCN, drops all thepackets received from the DCN, etc.).

The specific operations of the process 1900 may not be performed in theexact order shown and described. The specific operations may not beperformed in one continuous series of operations, and different specificoperations may be performed in different embodiments. Furthermore, theprocess 1900 could be implemented using several sub-processes, or aspart of a larger macro process.

FIG. 20 illustrates an example of a security agent operating in ahypervisor of a host machine along with a counter check security agentin order to protect a DCN-MFE against a malicious attack in thepass-through approach. This figure shows a host machine 2010 thatincludes a PNIC 2030 and a hypervisor 2035. The host machine 2010 alsoexecutes a DCN (virtual machine) 2040. The DCN 2040 includes a DCN-MFE2060 for packet forwarding processing, a call stack storage 2070, and amalicious module 2080 that imitates the DCN-MFE 2060. The hypervisor2035 includes a security agent 2020, a CCS agent 2050, and a validreturn address storage 2075. The PNIC 2030 of some embodiments is aphysical network interface controller that exchanges the network trafficbetween the host machine 2010 and other host machines and/or networkelements (e.g., physical external networks, etc.).

The hypervisor 2035 executes the counter check security agent 2050 forprotecting the DCN-MFE 2060 against any potential malicious attack inthe manner that described above by reference to FIGS. 16-18. However,some determined malicious modules similar to the malicious module 2080can imitate the counter increase message operation of the DCN-MFE. Thatis, the malicious module 2080 simulates both of DCN-MFE's operations ofsending a packet through a communication channel to the PNIC 2030, andsending a counter increase message to the CCS agent 2050, each time themalicious module sends a packet to the PNIC.

As such, the CCS agent 2050 alone is not able to protect the DCN-MFEagainst these types of modules. In order to protect the DCN-MFE 2060against a module such as the malicious module 2080, some embodimentsemploy a security module 2020 that operates on the communication channelbetween the CCS agent 2050 and the DCN 2040. As described above though,in some embodiments, the CCS agent 2050 itself performs the operationsof the security agent 2020 in addition to the operations that weredescribed above for the CCS agent.

The hypervisor 2035 also populates and maintains a valid returnaddresses data storage 2075 that contains a list of valid returnaddresses for the communication channel between the CCS agent 2050 andthe DCN-MFE 2060. In some embodiments, each return address that isstored in the data storage 2075 includes a valid return address of asubsequent instruction after the last instruction of the counterincrease function is executed. That is, each return address in the listof valid return addresses contains a memory address that a subsequentinstruction pointer may point to, after the last instruction of thecounter increase function that is called by the DCN-MFE is executed. Insome embodiments, each time a DCN-MFE is loaded into the memory of thehost machine, the hypervisor (or the security agent 2020 inside thehypervisor) populates the data storage 2075 with the list of validreturn addresses.

The DCN 2040 also maintains a call stack data storage 2070 that containsthe next return address that the processor should execute after afunction call such as the counter increase function call returns. Asdescribed above, the executable code of a DCN-MFE includes several linesof programming instructions. When an instruction line in the code callsa function (e.g., counter increase function), the DCN-MFE has to knowthe address of the following instruction after the function callinstruction in order to execute the next instruction after the functionis executed. That is, the return address contains a pointer to theaddress of a subsequent instruction line in the executable code that isafter the instruction line that calls the function. The DCN-MFE 2060, orany other executable code that executes in the DCN 2040, uses the callstack data storage 2070 to save the address of the following instructionbefore the control process switches from the DCN-MFE to the functionthat is being called.

As illustrated in the example figure, the malicious module 2080 hastransmitted a packet 2085 out towards the PNIC 2030, and at the sametime, the module has transmitted a message 2090 out to the CCS agent2050 that instructs the agent to increase its local counter value.However, before the counter increase message reaches the CCS agent 2050,the security agent 2020 operating on the communication channelintercepts the message. In order to ensure that the counter increasefunction has been called by the DCN-MFE (i.e., the counter increasemessage has been received from the DCN-MFE), the security agent 2020queries the call stack storage 2070 to receive the last return addressthat is stored in this storage.

The security agent 2020 then matches the received return address againstthe list f valid return addresses that are stored in the data storage2075. In the illustrated example, since the counter increase functionhas been called by the malicious module 2080, the last return addressthat the security module 2020 receives from the call stack storage 2070is an address pointing to an instruction inside the malicious modulecode. As such this address does not exist in the list of valid returnaddresses that point to addresses inside the DCN-MFE code. As such thehypervisor 2035 detects a malicious attack on the DCN 2040 (i.e., on theDCN-MFE 2060 running in the DCN) and takes the required step to protectthe DCN against the malicious attack.

In some embodiments, the security agent 2020 uses multiple checkpointsto ensure that a malicious function is not imitating the valid returnaddress method. That is, in some embodiments, the security agent 2020checks the call stack in multiple points (e.g., when the counterincrease function is called, at the end of the function, etc.), toensure that a determined malicious module has not used the call stack tostore a fake valid return address in the stack. Although FIG. 20illustrates an example of a security agent 2020 that protects theDCN-MFE against malicious attacks in a pass-through approach, one ofordinary skill in the art would realize that the security agent 2020protects the DCN-MFE in an emulation approach in the same manner that isdescribed above.

III. Electronic System

Many of the above-described features and applications are implemented assoftware processes that are specified as a set of instructions recordedon a computer readable storage medium (also referred to as computerreadable medium). When these instructions are executed by one or morecomputational or processing unit(s) (e.g., one or more processors, coresof processors, or other processing units), they cause the processingunit(s) to perform the actions indicated in the instructions. Examplesof computer readable media include, but are not limited to, CD-ROMs,flash drives, random access memory (RAM) chips, hard drives, erasableprogrammable read-only memories (EPROMs), electrically erasableprogrammable read-only memories (EEPROMs), etc. The computer readablemedia does not include carrier waves and electronic signals passingwirelessly or over wired connections.

In this specification, the term “software” is meant to include firmwareresiding in read-only memory or applications stored in magnetic storage,which can be read into memory for processing by a processor. Also, insome embodiments, multiple software inventions can be implemented assub-parts of a larger program while remaining distinct softwareinventions. In some embodiments, multiple software inventions can alsobe implemented as separate programs. Finally, any combination ofseparate programs that together implement a software invention describedhere is within the scope of the invention. In some embodiments, thesoftware programs, when installed to operate on one or more electronicsystems, define one or more specific machine implementations thatexecute and perform the operations of the software programs.

FIG. 21 conceptually illustrates an electronic system 2100 with whichsome embodiments of the invention are implemented. The electronic system2100 may be a computer (e.g., a desktop computer, personal computer,tablet computer, etc.), server, dedicated switch, phone, PDA, or anyother sort of electronic or computing device. Such an electronic systemincludes various types of computer readable media and interfaces forvarious other types of computer readable media. Electronic system 2100includes a bus 2105, processing unit(s) 2110, a system memory 2125, aread-only memory 2130, a permanent storage device 2135, input devices2140, and output devices 2145.

The bus 2105 collectively represents all system, peripheral, and chipsetbuses that communicatively connect the numerous internal devices of theelectronic system 2100. For instance, the bus 2105 communicativelyconnects the processing unit(s) 2110 with the read-only memory 2130, thesystem memory 2125, and the permanent storage device 2135.

From these various memory units, the processing unit(s) 2110 retrievesinstructions to execute and data to process in order to execute theprocesses of the invention. The processing unit(s) may be a singleprocessor or a multi-core processor in different embodiments.

The read-only-memory (ROM) 2130 stores static data and instructions thatare needed by the processing unit(s) 2110 and other modules of theelectronic system. The permanent storage device 2135, on the other hand,is a read-and-write memory device. This device is a non-volatile memoryunit that stores instructions and data even when the electronic system2100 is off. Some embodiments of the invention use a mass-storage device(such as a magnetic or optical disk and its corresponding disk drive) asthe permanent storage device 2135.

Other embodiments use a removable storage device (such as a floppy disk,flash memory device, etc., and its corresponding drive) as the permanentstorage device. Like the permanent storage device 2135, the systemmemory 2125 is a read-and-write memory device. However, unlike storagedevice 2135, the system memory 2125 is a volatile read-and-write memory,such a random access memory. The system memory 2125 stores some of theinstructions and data that the processor needs at runtime. In someembodiments, the invention's processes are stored in the system memory2125, the permanent storage device 2135, and/or the read-only memory2130. From these various memory units, the processing unit(s) 2110retrieves instructions to execute and data to process in order toexecute the processes of some embodiments.

The bus 2105 also connects to the input and output devices 2140 and2145. The input devices 2140 enable the user to communicate informationand select commands to the electronic system. The input devices 2140include alphanumeric keyboards and pointing devices (also called “cursorcontrol devices”), cameras (e.g., webcams), microphones or similardevices for receiving voice commands, etc. The output devices 2145display images generated by the electronic system or otherwise outputdata. The output devices 2145 include printers and display devices, suchas cathode ray tubes (CRT) or liquid crystal displays (LCD), as well asspeakers or similar audio output devices. Some embodiments includedevices such as a touchscreen that function as both input and outputdevices.

Finally, as shown in FIG. 21, bus 2105 also couples electronic system2100 to a network 2165 through a network adapter (not shown). In thismanner, the computer can be a part of a network of computers (such as alocal area network (“LAN”), a wide area network (“WAN”), or an Intranet,or a network of networks, such as the Internet. Any or all components ofelectronic system 2100 may be used in conjunction with the invention.

Some embodiments include electronic components, such as microprocessors,storage and memory that store computer program instructions in amachine-readable or computer-readable medium (alternatively referred toas computer-readable storage media, machine-readable media, ormachine-readable storage media). Some examples of such computer-readablemedia include RAM, ROM, read-only compact discs (CD-ROM), recordablecompact discs (CD-R), rewritable compact discs (CD-RW), read-onlydigital versatile discs (e.g., DVD-ROM, dual-layer DVD-ROM), a varietyof recordable/rewritable DVDs (e.g., DVD-RAM, DVD-RW, DVD+RW, etc.),flash memory (e.g., SD cards, mini-SD cards, micro-SD cards, etc.),magnetic and/or solid state hard drives, read-only and recordableBlu-Ray® discs, ultra density optical discs, any other optical ormagnetic media, and floppy disks. The computer-readable media may storea computer program that is executable by at least one processing unitand includes sets of instructions for performing various operations.Examples of computer programs or computer code include machine code,such as is produced by a compiler, and files including higher-level codethat are executed by a computer, an electronic component, or amicroprocessor using an interpreter.

While the above discussion primarily refers to microprocessor ormulti-core processors that execute software, some embodiments areperformed by one or more integrated circuits, such as applicationspecific integrated circuits (ASICs) or field programmable gate arrays(FPGAs). In some embodiments, such integrated circuits executeinstructions that are stored on the circuit itself. In addition, someembodiments execute software stored in programmable logic devices(PLDs), ROM, or RAM devices.

As used in this specification and any claims of this application, theterms “computer”, “server”, “processor”, and “memory” all refer toelectronic or other technological devices. These terms exclude people orgroups of people. For the purposes of the specification, the termsdisplay or displaying means displaying on an electronic device. As usedin this specification and any claims of this application, the terms“computer readable medium,” “computer readable media,” and “machinereadable medium” are entirely restricted to tangible, physical objectsthat store information in a form that is readable by a computer. Theseterms exclude any wireless signals, wired download signals, and anyother ephemeral signals.

This specification refers throughout to computational and networkenvironments that include virtual machines (VMs). However, virtualmachines are merely one example of data compute nodes (DCNs) or datacompute end nodes, also referred to as addressable nodes. DCNs mayinclude non-virtualized physical hosts, virtual machines, containersthat run on top of a host operating system without the need for ahypervisor or separate operating system, and hypervisor kernel networkinterface modules.

VMs, in some embodiments, operate with their own guest operating systemson a host using resources of the host virtualized by virtualizationsoftware (e.g., a hypervisor, virtual machine monitor, etc.). The tenant(i.e., the owner of the VM) can choose which applications to operate ontop of the guest operating system. Some containers, on the other hand,are constructs that run on top of a host operating system without theneed for a hypervisor or separate guest operating system. In someembodiments, the host operating system uses name spaces to isolate thecontainers from each other and therefore provides operating-system levelsegregation of the different groups of applications that operate withindifferent containers. This segregation is akin to the VM segregationthat is offered in hypervisor-virtualized environments that virtualizesystem hardware, and thus can be viewed as a form of virtualization thatisolates different groups of applications that operate in differentcontainers. Such containers are more lightweight than VMs.

Hypervisor kernel network interface modules, in some embodiments, is anon-VM DCN that includes a network stack with a hypervisor kernelnetwork interface and receive/transmit threads. One example of ahypervisor kernel network interface module is the vmknic module that ispart of the ESXi™ hypervisor of VMware, Inc.

It should be understood that while the specification refers to VMs, theexamples given could be any type of DCNs, including physical hosts, VMs,non-VM containers, and hypervisor kernel network interface modules. Infact, the example networks could include combinations of different typesof DCNs in some embodiments.

Additionally, the term “packet” is used throughout this application torefer to a collection of bits in a particular format sent across anetwork. It should be understood that the term “packet” may be usedherein to refer to various formatted collections of bits that may besent across a network. A few examples of such formatted collections ofbits are Ethernet frames, TCP segments, UDP datagrams, IP packets, etc.

While the invention has been described with reference to numerousspecific details, one of ordinary skill in the art will recognize thatthe invention can be embodied in other specific forms without departingfrom the spirit of the invention. In addition, a number of the figures(including FIGS. 3, 11, 14, 16, and 19) conceptually illustrateprocesses. The specific operations of these processes may not beperformed in the exact order shown and described. The specificoperations may not be performed in one continuous series of operations,and different specific operations may be performed in differentembodiments. Furthermore, the process could be implemented using severalsub-processes, or as part of a larger macro process. Thus, one ofordinary skill in the art would understand that the invention is not tobe limited by the foregoing illustrative details, but rather is to bedefined by the appended claims.

We claim:
 1. A method comprising: at a first managed forwarding elementexecuting within a first data compute node (DCN) of a plurality of DCNsthat operate on virtualization software executing on a first hostcomputer: from an application also executing within the first DCN,receiving a packet destined for a second DCN (i) that is logicallyconnected to the first DCN through a set of logical forwarding elementsof a logical network and (ii) that operates on virtualization softwareexecuting on a second host computer; performing forwarding processing onthe packet (i) to identify a particular logical forwarding element inthe set of logical forwarding elements, a logical port of which iscoupled to the second DCN, and (ii) to identify a second managedforwarding element that implements the logical port of the particularlogical forwarding element; and forwarding the packet to the secondmanaged forwarding element.
 2. The method of claim 1, wherein the secondmanaged forwarding element executes within the second DCN.
 3. The methodof claim 1, wherein the second managed forwarding element executeswithin the virtualization software that executes on the second hostcomputer that hosts the second DCN.
 4. The method of claim 3, whereinforwarding the packet to the second managed forwarding element comprisesforwarding the packet to a third managed forwarding element thatexecutes within the virtualization software executing on the first hostcomputer to be subsequently forwarded to the second managed forwardingelement.
 5. The method of claim 4, wherein the second managed forwardingelement, upon receiving the packet from the third managed forwardingelement, forwards the packet to a fourth managed forwarding element thatexecutes within the second DCN.
 6. The method of claim 3, wherein thesecond host computer also hosts a third DCN that is logically connectedto the first and second DCNs through the set of logical forwardingelements, wherein the second managed forwarding element is alsoconnected to a fourth managed forwarding element that executes withinthe third DCN in order to exchange logical network data packets with thethird DCN.
 7. The method of claim 1, wherein performing the forwardingprocessing on the packet comprises executing a pipeline for each one ofthe set of logical forwarding elements.
 8. The method of claim 1 furthercomprising, before forwarding the packet, using a particular tunnelprotocol to encapsulate the packet with tunnel endpoint addresses of thefirst and second managed forwarding elements as source and destinationtunnel endpoint addresses, respectively.
 9. The method of claim 8,wherein forwarding the packet to the second managed forwarding elementcomprises forwarding the encapsulated packet to a physical networkinterface controller (PNIC) of the first host machine to be subsequentlyforwarded to a PNIC of the second host machine.
 10. The method of claim9, wherein the second managed forwarding element executes within thesecond DCN, receives the encapsulated packet from the PNIC of the secondhost machine, and decapsulates the packet by removing the tunnelendpoint addresses from the packet according to the particular tunnelprotocol.
 11. The method of claim 1, wherein the logical port is a firstlogical port of the particular logical forwarding element, wherein thefirst DCN is also coupled to the particular logical forwarding elementthrough a second logical port of the particular logical forwardingelement.
 12. The method of claim 1, wherein the particular logicalforwarding element is a first logical forwarding element, wherein thefirst DCN is coupled to a second, different logical forwarding elementin the set of logical forwarding elements that is connected to the firstlogical forwarding element though a third logical forwarding element.13. The method of claim 12, wherein the first and second logicalforwarding elements are logical switches, wherein the third logicalforwarding element is a logical router, wherein both of the first andsecond managed forwarding elements implement all of the first, second,and third logical forwarding elements.
 14. A non-transitory machinereadable medium storing a first managed forwarding element (MFE) programexecutable by at least one processing unit of a first host computer, thefirst MFE program executing within a first data compute node (DCN) of aplurality of DCNs that operate on virtualization software executing onthe first host computer to perform forwarding processing for the firstDCN, the first MFE program comprising sets of instructions for: from anapplication also executing within the first DCN, receiving a packetdestined for a second DCN (i) that is logically connected to the firstDCN through a set of logical forwarding elements of a logical networkand (ii) that operates on virtualization software executing on a secondhost computer; performing forwarding processing on the packet (i) toidentify a particular logical forwarding element in the set of logicalforwarding elements, a logical port of which is coupled to the secondDCN, and (ii) to identify a second managed forwarding element thatimplements the logical port of the particular logical forwardingelement; and forwarding the packet to the second managed forwardingelement.
 15. The non-transitory machine readable medium of claim 14,wherein the first MFE program further comprises a set of instructionsfor receiving data that defines forwarding behaviors of the first MFEprogram from a controller application in order to implement the set oflogical forwarding elements, wherein the controller application alsoexecutes on the host computer.
 16. The non-transitory machine readablemedium of claim 15, wherein the controller application (i) receives datathat defines forwarding behavior of the set of logical forwardingelements, and (ii) based on the received data, generates the data thatdefines forwarding behaviors of the first MFE program.
 17. Thenon-transitory machine readable medium of claim 15, wherein thecontroller application further (i) generates data that definesforwarding behaviors of a third MFE that executes within thevirtualization software executing on the first host machine, and (ii)based on the generated data that defines the forwarding behaviors of thethird MFE, configures the third MFE to implement the set of logicalforwarding elements.
 18. The non-transitory machine readable medium ofclaim 17, wherein the set of instructions for forwarding the packetcomprises sets of instructions for: determining whether a physicalnetwork interface controller (PNIC) of the host machine is available totransmit the packet; and forwarding the packet to the third MFE when thePNIC is not available.
 19. The non-transitory machine readable medium ofclaim 18, wherein the set of instructions for forwarding the packetfurther comprises a set of instructions for forwarding the packet to thePNIC when the PNIC is available.
 20. The non-transitory machine readablemedium of claim 18, wherein the set of instructions for determiningwhether the PNIC is available comprises a set of instructions fordetermining whether the first DCN executes a PNIC driver to communicatedirectly with the PNIC.